Decision Support Systems

(1)

Advanced DB

CHAPTER 18 DATA ANALYSIS & MINING

(2)

Chapter 18: Data Analysis and Mining

Decision Support Systems

Data Analysis and OLAP

Data Warehousing

Data Mining

(3)

Online Transaction Processing

Transaction processing activities are needed to capture and process data in ol ed in transactions

process data involved in transactions

Online transaction processing systems

à Real time systems that capt re and process transactions immediately à Real-time systems that capture and process transactions immediately

à Can help provide superior service to customers and other trading partners

An integral part of the every day business of an enterprise An integral part of the every day business of an enterprise

à Order processing system of an electronics distributor à Banking system of a bankg

à Order control and billing system of a hospital

à Switches and billing system of a telephone company

(4)

Decision Support Systems

Decision-support systems are used to make business decisions

à What items to stock?What items to stock?

à What insurance premium to change?

à To whom to send advertisements?

Examples of data used for making decisions

à Retail sales transaction details

à Customer profiles (income age gender etc )Customer profiles (income, age, gender, etc.)

Identifying patterns in customer behavior and Using the patterns to make business decisions

à A Sudden spurt in purchases of flannel shirts

à Most of small sports cars are bought by young woman with more than $50,000 annual salaryy

(5)

Multidimensional View

Data that can be modeled as dimension attributes and measure attributes

attributes

sales(item-name, color, size, number)

Measure attributes

à example: number

Sales

Seoul

example: number à Measure some value à Can be aggregated upon

Item

1 2 3

A B

C Busan

Kwangju

Dimension attributes

à example: item name color size

Month

1 2 3

à example: item-name, color, size

à Define the dimensions on which measure attributes, and summaries of measure attributes

(6)

Online Analytical Processing (OLAP)

Interactive analysis of summary information

Statistical analysis often requires grouping on multiple attributes

Although complex statistical analysis is best left to statistics

packages, database should support simple, commonly used, forms of data analysis

Several SQL extensions have been developed to support OLAP

tools

(7)

Cross Tabulation of sales by item-name and color

An example of a cross-tabulation (cross-tab)

à also referred to as a pivot-table.

à Values for one of the dimension attributes form the row headers

à Values for another dimension attribute form the column headers

à Other dimension attributes are listed on top

à Values in individual cells are (aggregates of) the values of the dimension attributes that specify the cell.

dimension attributes that specify the cell.

(8)

Relational Representation of Cross-tabs

Cross-tabs can be represented as relations

relations

We use the value all is used to represent aggregates

Th SQL 1999 t d d t ll

The SQL:1999 standard actually uses null values in place of all

despite confusion with regular null values

(9)

Data Cube

A data cube is a multidimensional generalization of a cross-tab

C h di i h 3 b l

Can have n dimensions; we show 3 below

Cross-tabs can be used as views on a data cube

(10)

Online Analytical Processing Operations

Pivoting: changing the dimensions used in a cross-tab is called

Slicing: creating a cross-tab for fixed values only

à Sometimes called dicing, particularly when values for multiple dimensions are fixed

are fixed.

Rollup: moving from finer-granularity data to a coarser granularity

g y

Drill down: The opposite operation - that of moving from

coarser-granularity data to finer-granularity data g y g y

(11)

Hierarchies on Dimensions

Hierarchy on dimension attributes: lets dimensions to be viewed at different le els of detail

at different levels of detail

à E.g. the dimension DateTime can be used to aggregate by hour of day, date, day of week, month, quarter or yeary q y

(12)

Cross Tabulation With Hierarchy

Cross-tabs can be easily extended to deal with hierarchies

C d ill d ll hi h

à Can drill down or roll up on a hierarchy

(13)

OLAP Implementation

MOLAP (multidimensional OLAP)

à Multidimensional arrays in memory to store data cubes (earliest OLAPMultidimensional arrays in memory to store data cubes (earliest OLAP systems)

ROLAP (relational OLAP)

à Data stored in relational database à Data stored in relational database

HOLAP (hybrid OLAP)

à Store some summaries in memory

à Store the base data and other summaries in rdb

Many OLAP systems are implemented as client-server systems

(14)

OLAP Implementation (Cont.)

Precompute all possible aggregates in order to provide online response

response

à 2ⁿ combinations of group by

Precompute some aggregates, and compute others on demand p gg g p from one of the precomputed aggregates

à Can compute aggregate on (item-name, color) from an aggregate on (item- name color size)

name, color, size)

Several optimizations available for computing multiple aggregates

à Can compute aggregate on (item-name, color) from an aggregate on p gg g gg g (item-name, color, size)

à Can compute aggregates on (item-name, color, size), (item-name, color), and (item-name) using a single sorting of the base data

( ) g g g

(15)

Extended Aggregation in SQL:1999

The cube operation computes union of group by’s on every subset of the specified attributes

subset of the specified attributes

E.g. consider the query

select item-name, color, size, sum(number) select item name, color, size, sum(number) from sales

group by cube(item-name, color, size)

computes the union of eight different groupings of the sales relation:

{ (it l i ) (it l ) (it i )

{ (item-name, color, size), (item-name, color), (item-name, size), (color, size), (item-name), (color), (size), ( ) }

where ( ) denotes an empty group by list where ( ) denotes an empty group by list.

For each grouping, the result contains the null value for attributes not present in the grouping.

not present in the grouping.

(16)

Extended Aggregation (Cont.)

Relational representation of cross-tab of figure 18.2 can be computed by

computed by

select item-name color sum(number) select item name, color, sum(number) from sales

group by cube(item-name, color)

(17)

Data Warehouse - Motivation

Decision Support Problems in conventional systems (W. H. Inmon)

à Credibility of data: no time basis, algorithmic difference, no common sourcey , g ,

à Productivity: multiple sources in multiple environment, insufficient information, repeated work

à Data into information: data from multiple applications, consistency, past historyp pp y p y

Conventional DSS Cycle

Sales

request

Item

Month 1 2 3

A B C

executive modifications IS / IT

executive

(18)

Solution 1: Rebuild System

Pros

A & i i d d d l

à A new & consistent integrated data model

à Opportunity adopt new HW, SW, & methodology

Cons

à Large long-term project

à Change in management/organization/operations during projectg g / g / p g p j à Transition risks

à Still operations (OLTP) oriented

DSS is second priority at best

(19)

Solution 2: DW-based Restructuring

DSS DSS

EIS DW

Extract Transform Load

Legacy OLTP

EIS

g y

Reports Data Processing

(Transaction) Information Processing

(20)

Data Warehouse – Features

A repository (or archive) of information gathers from multiple sources stored under a unified schema at a single site

sources, stored under a unified schema, at a single site

à Once gathered, the data are stored for a long time, permitting access to historical data

à Provide the user a single consolidated interface to data, making decision- support queries easier to write

à Online transaction-processing systems are not affected by the decision-Online transaction processing systems are not affected by the decision support workload

Key Features

à Subject oriented à Integrated

à Time variant historical à Time-variant, historical

(21)

Data Warehouse – Features

Subject oriented

D i OLTP d i d/ d/ d i h

à Data in a OLTP system are designed/managed/used with a process/function-oriented view

à Data in a DW are designed/managed/used with a subject-oriented viewg g j

Focus of a DSS is in the subjects of an organization (not the processes)

Not all data from OLTP need to be moved to the DW

Operational Data Warehouse

Loan proc.

Deposit proc.

customer

account

Credit review operations

(22)

Data Warehouse – Features

Integrated

N bl d l

à Names: table names and column names à Units, Values and codes

à Aggregation and consolidationAggregation and consolidation

à A consistent source of information

balance

amount balance

A l A l th cm

amount

Curr. amount Appl. A length:cm

Appl. B length:inch

cm

transformation

(23)

Data Warehouse – Features

Time Variant

OLTP h f h d i ld

à OLTP: represents the current state of the domain world à DW: need to log the history of a subject

Monthly call charges for K. Kim for last yeary g y

Weekly sales for each brand (or item) à Time element must be present in the key

K Kim ^{K Kim} append

K. Kim addr:Busan K. Kim

addr:Seoul

…

append K. Kim

addr:Busan

…

K. Kim addr:Seoul

… date:2008-4-5 update

Moved from Seoul to Busan on

2008-4-5

…

date:2002-7-15 OLTP

DW K. Kim

(24)

Data Warehouse – Features

Nonvolatile

Li l ( ) d

à Little (or no) updates

à Most operations are new inserts (load) à Normalization is not importantNormalization is not important

OLTP

update OLTP DW

insert

update

(25)

Data Modeling

DB Design for OLTP

à Normalization: minimize update anomalies by eliminating redundancy à Normalization: minimize update anomalies by eliminating redundancy à Transactions usually access (relatively) small number of records

à Node – table / Link – join path

à Ad hoc join paths – not easy to predict access patterns

DB Design for Data warehouse

à Historical and summary oriented à Analytical “views”

à Non-volatile

à Dimensional model

Fact tables and dimensional tablesFact tables and dimensional tables

(26)

Data Modeling – OLTP

Shi Ship

Type Shipper Ship To Product

District Order Contact Product

District Credit

Order Item

Contact Loc

Product Line

Sales O d

Cust

Loc Product

Contrct

Order

Contrct

Type Customer

Loc

Contact

Product Group

Sales Rep

Sales District

Sales Division Sales

Region

(27)

Multi-dimensional Modeling – Fact table

Store all transactions for factual data about a business

Have a multi-part key (composite key)

Each element of the key is a FK to a single dimension table

The remaining fields in fact table are known as Facts

Facts are almost numeric – target of aggregation (count, sum, average, …)

much larger than dimension tables

(28)

Multi-dimensional Modeling – Dimension table

Dimensions are attribute of facts

Store hierarchical relationships between different grouping or characteristics of those transactions

Di i hi h

Dimension hierarchy

Time : Year, Quarter, Week, Day

Product : Group, Subgroup, Department, Class, Itemp, g p, p , ,

Territory : Division, Country, Region, Distinct à Unit of drill-down & roll-up

Have single key that is joined to a Fact table

The remaining fields are called attributes

(29)

Hierarchies on Dimensions

The different levels of detail for an attribute can be organized into a hierarch

into a hierarchy

(30)

Data Modeling Example

Requirements

à A convenience store chainA convenience store chain

à Sale results by date, item, and store

à Sales by area, month, item or brand

à Performance of promotions

Fact table

à Fact event: sale transactionFact event: sale transaction

à Atomic unit: item, day, store, …

à Fact table attributes: sales quantity, amount, cost

Dimension tables

à time: key, date, month, quarter, year, day-of-week, holiday-flag, season

à item: key name category package-type size weight coloritem: key, name, category, package type, size, weight, color

(31)

Data Modeling – Multi-dim. Star Schema

item#

item name timeKey

date

Item dimension Time dimension

item name item class

brand package

date dayofWeek

month quarter timeKey

item#

store#

i # size

weight color

q year holiday season promotion#

items_sold amount_sold

cost promotion#

promo_name store#

Promotion dimension Store dimension

cost p

promo_type discount%

displays promo cost

store_name address

city zipcode

Fact table

promo_cost promo_dates

zipcode area manager

Q: The top selling brands on holidays in Q1?

(32)

Example ^Queries

For the past 3 months, compare holiday and workday sales for each brand

each brand

à Time: restriction (month) & grouping (holiday) à Item: grouping (brand)Item: grouping (brand)

à Measure: sum(amount_sold)

Last quarter profitability per promotion type q p y p p yp

à Time: restriction (quarter)

à Promotion : grouping (promo_type)

à Measure : (sum(amount_sold)-sum(cost))/sum(promo_cost)

(33)

Data Warehouse Tuning

Characteristics

à Data Wareho ses are bigger terab tes petab tes à Data Warehouses are bigger – terabytes~petabytes à They change less often

à Queries use aggregation or complex WHERE clausesgg g p

Implications

à Scanning all data is very very slow à We can keep redundant data

à Index is important

T i t h i

Tuning techniques

à Bitmap index à Materialized viewate a ed v ew à Approximate answer

(34)

Bitmaps

Order of magnitude improvement

R Order of magnitude improvement

compared to scan.

Bitmaps are best suited for multiple conditions on several attributes, each

B Y

color

conditions on several attributes, each having a low selectivity.

Y N

holiday_flag

10 12 14

ec) bitmap - two

attributes

10 12

sec)

4 6 8 10

ponse Time (se attributes

r-tre e

4 6 8

ghput (Queries/s

line ar scan

bitmap p

bitmap

(35)

Materialized Views

highly

i d monthly sales per item-type

summarized monthly sales per item type

for past 10 years

M

lightly

summarized

weekly sales for past 5 years

Metadata

sales detail of current year Current

Detail ETL from

OLTP system

sales detail of current year (current detail)

sales detail for past 00 years (older detail)

OLTP system

(36)

Materialized Views

Graph:

14 p

à Oracle9i on Linux

à Total sale by vendor is materialized

Trade-off between query speed-up and view

6 8 10 12 14

Time (sec)

q y p p

maintenance:

à The impact of incremental maintenance on performance is significant.

R b ild i hi d

0 2 4 6

M aterialized View No M aterialized View

Response

à Rebuild maintenance achieves a good throughput.

à A static data warehouse offers a good trade-off.

1200 1600

sec)

M aterialized View No M aterialized View

400 800 1200

t (statements/

(37)

Approximations

Ideal application parameters: broad, dynamic, large data

Rippling

à When computing average of all salaries

à If average value changes little with increasing size, then it is probably accurate for all data (within some error bound)

( )

à Can be extended to multiple joins

(38)

Approximations (cont.)

Good approximation for query Q1 on

T PC-H Query Q1: lineitem Good approximation for query Q1 on

lineitem

The aggregated values obtained on a query with a 6-way join are significantly different

y

20 40 60

mated e Values Base e Value) 1% sample 10% sample

with a 6 way join are significantly different from the actual values -- for some

applications may still be good enough.

-40 -20 0

1 2 3 4 5 6 7 8

Aggregate values Approxi Aggregate (% of B Aggregate

T PC-H Query Q5: 6 way join

60

gg g

3 3.5 4

ec) base re lations

0 20 40 60

proximated gated values of Base ate Value)

0.5 1 1.5 2 2.5 3

sponse Time (se

1 % sample 10% sample

(39)

Other Issues

When and how to gather data

à Source-driven architecture

à Destination-driven architecture

à Data warehouses typically have slightly out-of-date data

What schema to use

à Usually an integration of multiple sources

à Must decide on a unified (consolidated) view of the data

Data cleansingg

à The process of correcting and preprocessing data

à Merge-purge operation

à Householdingg

How to propagate updates

à View-maintenance problem

What data to summarizeWhat data to summarize

à Maintain only summary data obtained by aggregation on a relation

à How do we decide on the unit granularity?

(40)

Data Mining

Data mining is the process of semi-automatically analyzing large databases to find sef l patterns

databases to find useful patterns

Knowledge Discovery in (Large) Database (KDD)

à "knowledge extraction" "data dredging" "data archaeology"

à knowledge extraction , data dredging , data archaeology ,

"data/pattern analysis“, etc.

Prediction based on past history p y

à Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past history

à Predict if a pattern of phone calling card usage is likely to be fraudulent

(41)

Data Mining (Cont.)

Some examples of prediction mechanisms:

Cl ifi i

Classification

à Given a new item whose class is unknown, predict to which class it belongs

Regression formulae

à Given a set of mappings for an unknown function, predict the function result for a new parameter value

Associations

à Find books that are often bought by “similar” customers. If a new such customer buys one such book suggest the others too

customer buys one such book, suggest the others too.

à Associations may be used as a first step in detecting causation

à E.g. association between exposure to chemical X and cancer,

Clusters

à E.g. typhoid cases were clustered in an area surrounding a contaminated well

à Detection of clusters remains important in detecting epidemics

(42)

Classification Rules

Classification rules help assign new objects to classes.

E i bil i li h ld h h b

à E.g., given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk?

Classification rules for above example could use a variety of data Classification rules for above example could use a variety of data, such as educational level, salary, age, etc.

à ∀ person P, P.degree = masters and P.income > 75,000

⇒ P.credit = excellent à ∀ person P, P.degree = bachelors and

(Pincome ≥ 25 000 and Pincome ≤ 75 000) (P.income ≥ 25,000 and P.income ≤ 75,000)

⇒ P.credit = good

Rules are not necessarily exact: there may be some y y

(43)

Decision Tree

Classification rules can be shown compactly as a decision tree.

(44)

Other Types of Classifiers

Neural net classifiers are studied in artificial intelligence and are not covered here

not covered here

Bayesian classifiers use Bayes theorem, which says p (c

_j

| d ) = p (d | c

_j

) p (c

_j

)

p (c

_j

| d ) p (d | c

_j

) p (c

_j

) p ( d )

where

p (c

_j

| d ) = probability of instance d being in class c

_j

,

p (d | c

_j

) = probability of generating instance d given class c

_j

, p (c

_j

) = probability of occurrence of class c

_j

p (d ) = probability of instance d occurring

(45)

Naïve Bayesian Classifiers

Bayesian classifiers require

i f (d | ) à computation of p (d | c_j) à precomputation of p (c_j )

à p (d ) can be ignored since it is the same for all classesp (d ) can be ignored since it is the same for all classes

To simplify the task, assume attributes have independent distributions;;

p (d | c

_j

) = p (d

₁

| c

_j

) p (d*

₂

| c

_j

) ….* (p (d*

_n

| c

_j

)

à Each of the p (d_i| c| _j_j) can be estimated from a histogram on dg _ivalues for each class c_j

à the histogram is computed from the training instances

(46)

Association Rules

Examples

à Someone who buys bread is quite likely also to buy milk à Someone who buys bread is quite likely also to buy milk

à A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts.

Association rules:

bread ⇒ milk

DB-Concepts, OS-Concepts ⇒ Networks

à Left hand side: antecedent, right hand side: consequent

à An association rule must have an associated population; the population p p ; p p consists of a set of instances

E.g. each transaction (sale) at a shop is an instance, and the set of all transactions is the populationp p

(47)

Association Rules (Cont.)

Rules have an associated support, as well as an associated confidence

confidence.

Support

à P(X ∩ Y) ho often does this p ttern o r à P(X ∩ Y): how often does this pattern occur

à what fraction of the population satisfies both the antecedent and the consequent of the rule

à E.g. suppose only 0.001 percent of all purchases include milk and screwdrivers. The support for the rule is milk ⇒ screwdrivers is low.

C fid

Confidence

à P(Y | X): how prevalent is the pattern given X

à how often the consequent is true when the antecedent is true à how often the consequent is true when the antecedent is true.

à E.g. the rule bread ⇒ milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milk.

(48)

Finding Association Rules

To discover association rule,

à (i( ₁₁, i, ₂₂, …, i, , _n_n) => i) ₀₀

à Find large itemsets

large itemset S: sets of items with sufficient support à Find rule (S – s)⇒s for every subset s⊂S( ) y

where (S – s)⇒s has sufficient confidence

B1 = {m, c, b} B2 = {m, p, j}

+ { , , } { , p, j}

B3 = {m, b} B4 = {c, j}

B5 = {m, p, b} B6 = {m, c, b, j}

B7 = {c, b, j} B8 = {b, c}

–

– +

{ , , j} { , }

An association rule: {m, b} -> c.

à Confidence = 2/4 = 50%

à Support = 2/8 = 25%

(49)

Finding Association Rules ^(cont.)

Sample Database

1. A supermarket sells 10,000 items

2. The average basket has 10 items

3. There are 10,000,000 baskets in the database

Table structure

à Baskets_long(BID, item1, item2, …, item10)

à Baskets(BID, item) ( , )

10 X 10,000,000 = 100,000,000 records

The support threshold is 1% of the baskets pp

à i.e., X=>Y is interesting only if X, Y are purchased together

at least 1% of the time (100,000 baskets) (

(50)

Frequent Pairs in SQL

SELECT b1.item, b2.item

Look for two

Basket tuples

FROM Baskets b1, Baskets b2

Basket tuples with the same

basket and different items.

WHERE b1.BID = b2.BID AND b1.item < b2.item

First item must precede second,

so we don’t

AND b1.item b2.item GROUP BY b1.item, b2.item

count the same pair twice.

HAVING COUNT(*) >= 100000;

Create a group for each pair of items each pair of items

(51)

A-Priori Algorithm

The naïve way (query in previous page)

à Join complexity: 100,000,000 X 100,000,000

à Each basket contributes C(10, 2) = 45 pairs

à Result of join has 45 X 10,000,000 records

à Need to perform Group by and Having

The A-Priori algorithm

à Key idea: If item i does not appear in s baskets, then no pair y pp , p including i can appear in s baskets

à So, do not consider items that appear in less than s baskets

(52)

A-Priori Algorithm ^(cont.)

The new query

INSERT INTO Baskets1(basket item) INSERT INTO Baskets1(basket, item) SELECT * FROM Baskets

WHERE item IN ( SELECT item FROM Baskets GROUP BY item

HAVING COUNT(*) >= 100000 );

Perform previous query with Baskets1

Join computation

à Computing Baskets1 is cheap since there is no join

Or could have built index from beginning à At most 1 000 items can be frequentAt most 1,000 items can be frequent

(53)

Other Types of Mining

Text mining: application of data mining to textual documents

l W b fi d l d

à cluster Web pages to find related pages

à cluster pages a user has visited to organize their visit history à classify Web pages automatically into a Web directoryclassify Web pages automatically into a Web directory

Data visualization systems help users examine large volumes of data and detect patterns visually p y

à Can visually encode large amounts of information on a single screen à Humans are very good a detecting visual patterns

(54)

 Decision Support Systems

CHAPTER 18

DATA ANALYSIS & MINING

Chapter 18: Data Analysis and Mining

 Decision Support Systems

 Data Analysis and OLAP

 Data Warehousing

 Data Mining

Online Transaction Processing

 Transaction processing activities are needed to capture and process data in ol ed in transactions

process data involved in transactions

 Online transaction processing systems

 An integral part of the every day business of an enterprise An integral part of the every day business of an enterprise

Decision Support Systems

Multidimensional View

 Data that can be modeled as dimension attributes and measure attributes

attributes

 Measure attributes

 Dimension attributes

Online Analytical Processing (OLAP)

 Interactive analysis of summary information

 Statistical analysis often requires grouping on multiple attributes

 Although complex statistical analysis is best left to statistics

packages, database should support simple, commonly used, forms of data analysis

 Several SQL extensions have been developed to support OLAP

tools

Cross Tabulation of sales by item-name and color

An example of a cross-tabulation (cross-tab)

Relational Representation of Cross-tabs

Data Cube

 A data cube is a multidimensional generalization of a cross-tab

Online Analytical Processing Operations

 Pivoting: changing the dimensions used in a cross-tab is called

 Slicing: creating a cross-tab for fixed values only

 Rollup: moving from finer-granularity data to a coarser granularity

g y

 Drill down: The opposite operation - that of moving from

coarser-granularity data to finer-granularity data g y g y

Hierarchies on Dimensions

 Hierarchy on dimension attributes: lets dimensions to be viewed at different le els of detail

at different levels of detail

Cross Tabulation With Hierarchy

 Cross-tabs can be easily extended to deal with hierarchies

OLAP Implementation

 MOLAP (multidimensional OLAP)

 ROLAP (relational OLAP)

 HOLAP (hybrid OLAP)

 Many OLAP systems are implemented as client-server systems

OLAP Implementation (Cont.)

 Precompute all possible aggregates in order to provide online response

response

 Precompute some aggregates, and compute others on demand p gg g p from one of the precomputed aggregates

 Several optimizations available for computing multiple aggregates

Extended Aggregation in SQL:1999

 The cube operation computes union of group by’s on every subset of the specified attributes

subset of the specified attributes

 E.g. consider the query

select item-name, color, size, sum(number) select item name, color, size, sum(number) from sales

group by cube(item-name, color, size)

computes the union of eight different groupings of the sales relation:

{ (it l i ) (it l ) (it i )

{ (item-name, color, size), (item-name, color), (item-name, size), (color, size), (item-name), (color), (size), ( ) }

where ( ) denotes an empty group by list where ( ) denotes an empty group by list.

 For each grouping, the result contains the null value for attributes not present in the grouping.

not present in the grouping.

Extended Aggregation (Cont.)

 Relational representation of cross-tab of figure 18.2 can be computed by

computed by

select item-name color sum(number) select item name, color, sum(number) from sales

group by cube(item-name, color)

Data Warehouse - Motivation

Solution 1: Rebuild System

 Pros

 Cons

 Cons

Solution 2: DW-based Restructuring

Data Warehouse – Features

 A repository (or archive) of information gathers from multiple sources stored under a unified schema at a single site

sources, stored under a unified schema, at a single site

 Key Features

Decision Support Systems

Decision Support Systems

Data Analysis and OLAP

Data Warehousing

Data Mining

Transaction processing activities are needed to capture and process data in ol ed in transactions

Online transaction processing systems

An integral part of the every day business of an enterprise An integral part of the every day business of an enterprise

Data that can be modeled as dimension attributes and measure attributes

Measure attributes

Dimension attributes

Interactive analysis of summary information

Statistical analysis often requires grouping on multiple attributes

Although complex statistical analysis is best left to statistics

Several SQL extensions have been developed to support OLAP

A data cube is a multidimensional generalization of a cross-tab

Pivoting: changing the dimensions used in a cross-tab is called

Slicing: creating a cross-tab for fixed values only

Rollup: moving from finer-granularity data to a coarser granularity

Drill down: The opposite operation - that of moving from

Hierarchy on dimension attributes: lets dimensions to be viewed at different le els of detail

Cross-tabs can be easily extended to deal with hierarchies

MOLAP (multidimensional OLAP)

ROLAP (relational OLAP)

HOLAP (hybrid OLAP)

Many OLAP systems are implemented as client-server systems

Precompute all possible aggregates in order to provide online response

Precompute some aggregates, and compute others on demand p gg g p from one of the precomputed aggregates

Several optimizations available for computing multiple aggregates

The cube operation computes union of group by’s on every subset of the specified attributes

E.g. consider the query

For each grouping, the result contains the null value for attributes not present in the grouping.

Relational representation of cross-tab of figure 18.2 can be computed by

Pros

Cons

Cons

A repository (or archive) of information gathers from multiple sources stored under a unified schema at a single site

Key Features

Subject oriented

Integrated

Time Variant

Nonvolatile

DB Design for OLTP

DB Design for Data warehouse

DB Design for Data warehouse

Store all transactions for factual data about a business

Have a multi-part key (composite key)

Each element of the key is a FK to a single dimension table

The remaining fields in fact table are known as Facts

Facts are almost numeric – target of aggregation (count, sum, average, …)

much larger than dimension tables

Dimensions are attribute of facts

Store hierarchical relationships between different grouping or characteristics of those transactions

Dimension hierarchy

Have single key that is joined to a Fact table

The remaining fields are called attributes

The different levels of detail for an attribute can be organized into a hierarch

Example ^Queries

For the past 3 months, compare holiday and workday sales for each brand

Last quarter profitability per promotion type q p y p p yp

Characteristics

Implications

T i t h i

Tuning techniques

Data mining is the process of semi-automatically analyzing large databases to find sef l patterns

Knowledge Discovery in (Large) Database (KDD)

Prediction based on past history p y

Classification rules help assign new objects to classes.

Classification rules for above example could use a variety of data Classification rules for above example could use a variety of data, such as educational level, salary, age, etc.

Rules are not necessarily exact: there may be some y y

Classification rules can be shown compactly as a decision tree.

Neural net classifiers are studied in artificial intelligence and are not covered here

Bayesian classifiers use Bayes theorem, which says p (c