CHAPTER 13

(1)

Intro to DB

CHAPTER 13 QUERY PROCESSING

(2)

Chapter 13: Query Processing

Overview

Measures of Query Cost

Selection Operation

Sorting

Join Operation

Other Operations

Evaluation of Expressions

(3)

Basic Steps in Query Processing

1. Parsing and translation 2. Optimization

3. Evaluation

(4)

Basic Steps

Parsing and translation

l h i i l f ( l i l l b )

à translate the query into an internal form (eg., relational algebra) à Parser checks syntax, verifies relations

Evaluation

à The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query.

More than one way to evaluate a query

à equivalence of expressions:

SELECT balance FROM account WHERE balance>2500

Π_balance(σbalance>2500(account)) OR σbalance>2500(Π_balance(account))

à Different strategies and algorithms

σ (account) :

sequential scan OR index scan on balance

σ (account) :

sequential scan OR index scan on balance

(5)

Query Plan

Evaluation primitive

à a relational algebra expression annotated with instructions on how to evaluate ita relational algebra expression annotated with instructions on how to evaluate it

σbalance>2500(account): use index 1

σbalance>2500(account): use table scan

Query evaluation (execution) plan

à a sequence of primitive operations that can be used to evaluate a query

Example:

à SELECT balance FROM account WHERE balance>2500

Π _balance σbalance>2500: seq. scan

Π

σ balance>2500: use index 1 Π _balance

account account

(6)

Query Optimization

Equivalence of Expressions

Gi DB h S Q S i i l h Q’

Given a DB schema S, a query Q on S is equivalent to another query Q’ _on S, if the answer sets of Q and Q’ are the same in any instances of the DB.

Π (σ (customer ⋈ depositor ⋈ branch)) vs Πb_name, asset(σc_city=“PC”(customer ⋈ depositor ⋈ branch)) vs

Πb_name, asset((σc_city=“PC”(customer)) ⋈ depositor ⋈ branch)

Query optimization is the process of selecting the most efficient query

l ti pl f i

evaluation plan for a given query

à Most efficient? How do we know?

(7)

Measures of Query Cost

Cost is generally measured as total elapsed time for answering q er

query

à Many factors contribute to time cost

disk accesses, CPU, or even network communication, ,

In a centralized system, disk access is the predominant cost

à average-seek-cost, average-block-read-cost, average-block-write-costg g g

For simplicity we use number of block transfers from/to disk as the cost measure

à ignore the difference in cost between sequential and random I/O à Ignore the difference in cost between disk reads and disk writes

i CPU

à ignore CPU costs

à ignore writing final output to disk

Real systems take all of these measures into account

(8)

Selection Operation

File scan

à locate and retrieve records that fulfill a selection condition à locate and retrieve records that fulfill a selection condition à Full file scan: retrieve all records of a file (relation)

A1: linear search

à Scan each file block and test all records on the selection condition

b

à Cost estimate (number of disk blocks scanned) =

b

_r

# of blocks containing records from relation r à If selection is on a key attribute, cost = (If selection is on a key attribute, cost (

b b

_r /2)/2)

stop on finding record

à Linear search can be applied regardless of

l i di i

selection condition or

ordering of records in the file, or

availability of indices

(9)

Selection Operation (Cont.)

A2: binary search

A li bl if l i i li i h ib hi h

à Applicable if selection is an equality comparison on the attribute on which file is ordered

à Assume that the blocks of a relation are stored contiguously g y à Cost estimate (number of disk blocks to be scanned):

⎡log₂(b_r)⎤ — cost of locating the first tuple by a binary search on the blocks

Pl b f bl k i i d h i f l i di i

Plus number of blocks containing records that satisfy selection condition

A3: primary index on candidate key, equality A3: primary index on candidate key, equality

à Retrieve a single record that satisfies the corresponding equality condition à Cost = HT_i + 1

(10)

Selection Operation (Cont.)

A4: primary index on nonkey, equality

R i l i l d

à Retrieve multiple records

à Records will be on consecutive blocks

à Cost = HTCost HT_i_i + number of blocks containing retrieved records number of blocks containing retrieved records

A5: equality on search-key of secondary index

à Retrieve a single record if the search-key is a candidate key

Cost = HT_i + 1

R i l i l d if h k i did k

à Retrieve multiple records if search-key is not a candidate key

Cost = HT_i + number of records retrieved

Can be very expensive!y p

each record may be on a different block

one block access for each retrieved record

(11)

Selections Involving Comparisons

Can implement selections of the form σ_A≤V(r) or σ_{A ≥ V}(r) by using

à a linear file scan or binary search,y ,

à or by using indices in the following ways:

A6: primary index comparison

A6: primary index, comparison

(Relation is sorted on A)

For σ_A_≥_V(r) use index to find first tuple ≥ v and scan relation sequentially from there

from there

For σ_A≤V(r) just scan relation sequentially till first tuple > v; do not use index

A7: secondary index comparison

A7: secondary index, comparison

For σ_A_≥_V(r) use index to find first index entry ≥ v and scan index sequentially from there, to find pointers to records.

For σ_A≤V_A≤V(r) just scan leaf pages of index finding pointers to records, till first ( ) j p g d d g p d , entry > v

In either case,

Retrieving records that are pointed to requires an I/O for each record

Linear file scan may be cheaper if many records are to be fetched!

(12)

Sorting

Main memory based algorithms

B bbl i k

à Bubblesort, quicksort, etc.

à Rely on frequent exchange of elements

à Not good for relations that don’t fit in memoryNot good for relations that don t fit in memory

Use index on sort key

à If one exists, simply access records in order, p y

à Even if one does not exist, we may build one, and then use the index to read the relation in sorted order

I d b ildi + i d i l

Index building cost + index sequential access cost à May be quite effective

In general external sort-merge is a good choice

(13)

External Sort-Merge

Let M be the memory (buffer) size in blocks 1 Create sorted runs

1. Create sorted runs.

Let i = 1

repeat until all blocks of relation have been read:

(a) Read next M blocks of relation into memory (a) Read next M blocks of relation into memory (b) Sort the in-memory blocks

(c) Write sorted data to run R_i; increment i.

Let the total number of runs be N

2. Merge the runs (N-way merge). (assume for now that N < M)

1. Use N blocks of memory to buffer input runs, and 1 block to buffer output. Read the first block of each run into its buffer pagep g

2. repeat

1. Select the first record (in sort order) among all buffer pages

2. Write the record to the output buffer. If the output buffer is full write it to disk.p p 3. Delete the record from its input buffer page.

If the buffer page becomes empty then

read the next block (if any) of the run into the buffer.

3 il ll i b ff

3. until all input buffer pages are empty:

(14)

External Sort-Merge (Cont.)

If N ≥ M, several merge passes are required.

I h i f M 1 d

à In each pass, contiguous groups of M - 1 runs are merged.

à A pass reduces the number of runs by a factor of M -1, and creates runs longer by the same factor. g y

E.g. If M=11, and there are 90 runs, one pass reduces the number of runs to 9 à Repeated passes are performed till all runs have been merged into one.

(15)

External Sort-Merge – Example 1

M=3 M 3

(16)

External Sort-Merge – Example 2

M=3

32 12 47 25 29 63 52 2 30 55 65 5 9 20 15 31 8 50 22 45 10 1 11 24 3 33 17 58 40 26 36 16 4

5 8 9 15 20 31 50 55 65

2 12 25 29 30 32 47 52 63 1 3 10 11 17 22 24 33 45 4 16 26 36 40 58

2 5 8 9 12 15 20 25 29 30 31 32 47 50 52 55 63 65 1 3 4 10 11 16 17 22 24 26 33 36 40 45 58

1 2 3 4 5 8 9 10 11 12 15 16 17 20 22 24 25 26 29 30 31 32 33 36 40 45 47 50 52 55 58 63 65

(17)

External Sort-Merge (Cont.)

Cost analysis:

T l b f i d ⎡l (b /M)⎤

à Total number of merge passes required: ⎡log_M–1(b_r/M)⎤.

à Disk accesses for initial run creation (as well as in each pass) is 2b_r

for final pass, we don’t count write cost p ,

we ignore final write cost for all operations since the output of an operation may be sent to the parent operation without being written to disk

Thus total number of disk accesses for external sorting:

2b_r+ 2b_r ⎡log_M–1(b_r /M)⎤ - b_r

⎡ / )⎤

= b_r( 2 ⎡log_M–1(b_r /M)⎤ + 1)

(18)

Join Operation

Several different algorithms to implement joins

N d l j i

à Nested-loop join

à Block nested-loop join à Indexed nested-loop joinIndexed nested loop join à Merge-join

à Hash-join

Choice based on cost estimate

Running Example

à Number of records of customer: 10,000 depositor: 5000 à Number of blocks of customer: 400 depositor: 100

(19)

Nested-Loop Join

To compute the theta join r ⋈

_θ

s

for each tuple t_r in r do for each tuple tp _s_s in s do

if pair (t_r, t_s) satisfy the join condition θ add t_r • t_s to the result.

à r is called the outer relation à s the inner relation

Requires no indices and can be used with any kind of join condition.

Expensive since it examines every pair of tuples in the two

relations.

(20)

Nested-Loop Join (Cont.)

Worst case

h i l h ld bl k f h l i

à there is memory only to hold one block of each relation

n

_r

∗ b

_s

+ b

_r

disk accesses.

Best case

à the smaller relation fits entirely in memory: use it as the inner relation b_r + b_s

Example

à Worst case

5,000 ∗ 400 + 100 = 2,000,100 (depositor as outer relation)

1,000 ∗ 100 + 400 = 1,000,400 (customer as the outer relation), , , ( ) à If smaller relation (depositor) fits entirely in memory

100 + 400 = 500

(21)

Block Nested-Loop Join

Variant of nested-loop join

bl k f i l i i i d i h bl k f l i

à every block of inner relation is paired with every block of outer relation

for each block B

_r

of r do

f h bl k B f d

for each block B

_s

of s do for each tuple t

_r

in B

_r

do

f h l i B d

for each tuple t

_s

in B

_s

do

if (t

_r

, t

_s

) satisfy the join condition

h dd h l

then add t

_r

• t

_s

to the result

Worst case estimate

à Each block in the inner relation s is read once for each block in the outer relation

à b_r_r ∗ b_s_s + b_r_r

(22)

Block Nested-Loop Join (Cont.)

Best case

à b

_r

+ b

_s

block accesses.

Improvements

à Use M-2 disk blocks as blocking unit for outer relations, and remaining two blocks to buffer inner relation and output

Cost =

⎡b

_r

/ (M-2)⎤

^∗

b

_s

+ b

_r

à If equi-join attribute forms a key of inner relation, stop inner q j y , p loop on first match

à Scan inner loop forward and backward alternately, to make use

of the blocks remaining in buffer (with LRU replacement)

(23)

Block Nested-Loop Join - Example

5 q 7 r

M=4

r s

5 q 7 r

A 3 B 10 C 8

5 q 7 r 6 s

6 s 1 t A 3

6 s 1 t C 8

6 s 1 t E 1

6 s 1 t G 8

6 s 1 t I 7 J 2

6 s 1 t K 4

6 s 1 t M 3

6 s 1 t O 9 C 8

D 6 E 1 F 9

6 s 1 t 6 u 1 v

B 10 D 6

D 6 s

F 9 D 6 s E 1 t

H 2 J 2

I 7 r

L 5 I 7 r L 5 q

N 6 N 6 s

P 1 N 6 s P 1 t G 8

H 2 I 7 J 2

2 w 3 x 4 y 8 z

6 u 1 v

6 u 1 v J 2

K 4 L 5 M 3

8 z

2 w 3 x A 3

2 w 3 x C 8

2 w 3 x E 1

2 w 3 x G 8

2 w 3 x I 7

2 w 3 x K 4

2 w 3 x M 3

2 w 3 x O 9 N 6

O 9 P 1

B 10 A 3 x

D 6 A 3 x D 6 u

F 9 E 1 v

H 2 E 1 v H 2 w

J 2 J 2 w

L 5 J 2 w

N 6 J 2 w M 3 x

P 1 N 6 u P 1 v

* Red blocks are flushed

(24)

Block Nested-Loop Join – Example (cont.)

4 y 4 y 4 y 4 y 4 y

8

4 y 4 y 4 y

M=4

r s

8 z 8 z 8 z 8 z 8 z

I 7

8 z 8 z 8 z

A 3 B 10 C 8

5 q 7 r

6 A 3

B 10

C 8 D 6 C 8 z

E 1 F 9 C 8 z

G 8 H 2 C 8 z G 8 z

I 7 J 2

K 4 L 5 K 4 y

M 3 N 6 K 4 y

O 9 P 1 K 4 y C 8

D 6 E 1 F 9

6 s 1 t 6 u

1 v G 8 z

* Red blocks are flushed

G 8 H 2 I 7 J 2

2 w 3 x 4 y 8

D 6 s I 7 r N 6 s A 3 x E 1 v J 2 w N 6 u C 8 z K 4 y

Final result

J 2 K 4 L 5 M 3

8 z

E 1 t L 5 q P 1 t D 6 u H 2 w M 3 x P 1 v G 8 z

y N 6

O 9 P 1

(25)

Hash-Join

Applicable for equi-joins and natural joins.

A hash function h is used to partition tuples of both relations

h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the

ib f d d i h l j i

common attributes of r and s used in the natural join.

à r₀, r₁, . . ., r_n : partitions of r

Each tuple tEach tuple t_r ∈∈ r is put in partition rr is put in partition r_i_i where i = h(t [JoinAttrs])where i h(t_r [JoinAttrs]).

à s₀, s₁, . . ., s_n : partitions of s

r tuples in r p

_i_i

need to be compared with s tuples in s p p

_i_i

only y

(Note: In the textbook, r_i is denoted as H_ri s_i is denoted as H_si and n is denoted

( , _i _ri, _i _si,

as n_h.)

(26)

Hash-Join - Example

(27)

Hash-Join Algorithm

1. Partition the relation s using hashing function h.

Wh i i i l i bl k f i d h

à When partitioning a relation, one block of memory is reserved as the output buffer for each partition.

2. Partition r similarly.

3. For each i:

(a) Load s_i into memory (a) Load s_i into memory

build an in-memory hash index on it using the join attribute (using a different hash function than the earlier one h)

(b) R d th t l i f th di k b

(b) Read the tuples in r_i from the disk one by one

For each tuple t_r locate each matching tuple t_s in s_i using the in-memory hash index. Output the concatenation of their attributes.

Relation s is called the build input and

r is called the probe input

(28)

Hash-Join algorithm (Cont.)

The value n and the hash function h is chosen such that each s

_i

sho ld fit in memor

should fit in memory.

Hash-table overflow occurs in partition s

_i

if s

_i

does not fit in memory

memory.

à Many tuples in s with same value for join attributes or bad hash function à Overflow resolution can be done in build phasep

Partition s_i is further partitioned using different hash function.

Partition r_i must be similarly partitioned.

F il i h l b f d li

à Fails with large numbers of duplicates

Fallback option: use block nested loops join on overflowed partitions

(29)

Cost of Hash-Join

Cost of hash join (without recursive partitioning) 3(b + b ) + 4n (4n can be ignored)

3(b_r b_s) 4n (4n can be ignored)

à If the entire build input can be kept in main memory, n can be set to 0 and the algorithm does not partition the relations into temporary files. Cost estimate goes down to b + b

down to b_r + b_s.

Example:

à customer ⋈ depositor

à memory size: 20 blocks; b_depositor= 100 and b_customer = 400.

à depositor is build input

à Partition it into 6 partitions each of size less than 20 blocks

à Partition it into 6 partitions, each of size less than 20 blocks

à Similarly, partition customer into 6 partitions each of size about 70

à Therefore total cost:

3(100 + 400) + 4*6 = 1,524 block transfers

à ignores cost of writing partially filled blocks

(30)

Evaluation of Expressions

So far, we have seen algorithms for individual operations

How do you evaluate an entire expression tree?

Materialization

à generate results of an expression whose inputs are relations or are already computed, materialize (store) it on disk. Repeat.

Pipelining

à pass on tuples to parent operations even as an operation is being executed

(31)

Materialization

Evaluate one operation at a time, starting at the lowest-level.

Use intermedi te res lts t ri liz d into t por ry r l tions to e l te ne t le el

Use intermediate results materialized into temporary relations to evaluate next-level operations.

E.g., in figure below

)

g g (

à compute and store

à then compute and store its join with customer

à and finally compute the projections on customer name )

2500(account

balance<

σ

à and finally compute the projections on customer-name.

(32)

Materialization (Cont.)

Materialized evaluation is always applicable

Cost of writing results to disk and reading them back can be quite high

à O r cost form las for operations ignore cost of writing res lts to disk à Our cost formulas for operations ignore cost of writing results to disk à Overall cost = Sum of costs of individual operations +

cost of writing intermediate results to disk

(33)

Pipelining

Evaluate several operations simultaneously

simultaneously

à passing the results of one operation on to the next.

E.g., in expression tree

à don’t store result of selection à instead, pass tuples directly to

the join

à Similarly don’t store result of à Similarly, don t store result of

join but pass tuples directly to projection

(34)

Pipelining (cont.)

Much cheaper than materialization

d l i di k

à no need to store a temporary relation to disk.

Pipelining may not always be possible

à e g sort hash join: m st wait for entire inp t to materialize à e.g., sort, hash-join: must wait for entire input to materialize à Very difficult to achieve a lengthy chain of pipeline

For pipelining to be effective For pipelining to be effective

à use evaluation algorithms that generate output tuples even as tuples are received for inputs to the operation

(35)

CHAPTER 13

CHAPTER 13

QUERY PROCESSING

Chapter 13: Query Processing

 Overview

 Measures of Query Cost

 Selection Operation

 Sorting

 Join Operation

 Other Operations

 Evaluation of Expressions

Basic Steps in Query Processing

1. Parsing and translation 2. Optimization

3. Evaluation

Basic Steps

 Parsing and translation

 Evaluation

 Evaluation

 More than one way to evaluate a query

sequential scan OR index scan on balance

sequential scan OR index scan on balance

Query Plan

Query Optimization

 Equivalence of Expressions

 Query optimization is the process of selecting the most efficient query

l ti pl f i

evaluation plan for a given query

Measures of Query Cost

 Cost is generally measured as total elapsed time for answering q er

query

 In a centralized system, disk access is the predominant cost

 For simplicity we use number of block transfers from/to disk as the cost measure

 Real systems take all of these measures into account

 Real systems take all of these measures into account

Selection Operation

 File scan

 A1: linear search

b

b

b b

Selection Operation (Cont.)

 A2: binary search

 A3: primary index on candidate key, equality A3: primary index on candidate key, equality

Selection Operation (Cont.)

 A4: primary index on nonkey, equality

 A5: equality on search-key of secondary index

Selections Involving Comparisons

 A6: primary index comparison

 A6: primary index, comparison

 A7: secondary index comparison

 A7: secondary index, comparison

Sorting

 Main memory based algorithms

 Use index on sort key

 In general external sort-merge is a good choice

 In general external sort-merge is a good choice

External Sort-Merge

External Sort-Merge (Cont.)

 If N ≥ M, several merge passes are required.

External Sort-Merge – Example 1

External Sort-Merge – Example 2

External Sort-Merge (Cont.)

 Cost analysis:

Join Operation

 Several different algorithms to implement joins

 Choice based on cost estimate

 Running Example

Nested-Loop Join

 To compute the theta join r ⋈

s

 Requires no indices and can be used with any kind of join condition.

 Expensive since it examines every pair of tuples in the two

relations.

Nested-Loop Join (Cont.)

 Worst case

n

∗ b

+ b

 Best case

 Example

Overview

Measures of Query Cost

Selection Operation

Sorting

Join Operation

Other Operations

Evaluation of Expressions

Parsing and translation

Evaluation

Evaluation

More than one way to evaluate a query

Equivalence of Expressions

Query optimization is the process of selecting the most efficient query

Cost is generally measured as total elapsed time for answering q er

In a centralized system, disk access is the predominant cost

For simplicity we use number of block transfers from/to disk as the cost measure

Real systems take all of these measures into account

Real systems take all of these measures into account

File scan

A1: linear search

A2: binary search

A3: primary index on candidate key, equality A3: primary index on candidate key, equality

A4: primary index on nonkey, equality

A5: equality on search-key of secondary index

A6: primary index comparison

A6: primary index, comparison

A7: secondary index comparison

A7: secondary index, comparison

Main memory based algorithms

Use index on sort key

In general external sort-merge is a good choice

In general external sort-merge is a good choice

If N ≥ M, several merge passes are required.

Cost analysis:

Several different algorithms to implement joins

Choice based on cost estimate

Running Example

To compute the theta join r ⋈

Requires no indices and can be used with any kind of join condition.

Expensive since it examines every pair of tuples in the two

Worst case

Best case

Example

Variant of nested-loop join

Worst case estimate

Best case

Improvements

Applicable for equi-joins and natural joins.

A hash function h is used to partition tuples of both relations

h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the

r tuples in r p

Relation s is called the build input and

The value n and the hash function h is chosen such that each s

Hash-table overflow occurs in partition s

So far, we have seen algorithms for individual operations

How do you evaluate an entire expression tree?

Materialization

Pipelining

Pipelining