Intro to DB
CHAPTER 13
QUERY PROCESSING
Chapter 13: Query Processing
Overview
Measures of Query Cost
Selection Operation
Sorting
Join Operation
Other Operations
Evaluation of Expressions
Basic Steps in Query Processing
1. Parsing and translation 2. Optimization
3. Evaluation
Basic Steps
Parsing and translation
l h i i l f ( l i l l b )
à translate the query into an internal form (eg., relational algebra) à Parser checks syntax, verifies relations
Evaluation
Evaluation
à The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query.
More than one way to evaluate a query
à equivalence of expressions:
SELECT balance FROM account WHERE balance>2500
Πbalance(σbalance>2500(account)) OR σbalance>2500(Πbalance(account))
à Different strategies and algorithms
σ (account) :
sequential scan OR index scan on balance
σ (account) :
sequential scan OR index scan on balance
Query Plan
Evaluation primitive
à a relational algebra expression annotated with instructions on how to evaluate ita relational algebra expression annotated with instructions on how to evaluate it
σbalance>2500(account): use index 1
σbalance>2500(account): use table scan
Query evaluation (execution) plan
à a sequence of primitive operations that can be used to evaluate a query
Example:
Example:
à SELECT balance FROM account WHERE balance>2500
Π balance σbalance>2500: seq. scan
Π
σ balance>2500: use index 1 Π balance
account account
Query Optimization
Equivalence of Expressions
Gi DB h S Q S i i l h Q’
Given a DB schema S, a query Q on S is equivalent to another query Q’ on S, if the answer sets of Q and Q’ are the same in any instances of the DB.
Π (σ (customer ⋈ depositor ⋈ branch)) vs Πb_name, asset(σc_city=“PC”(customer ⋈ depositor ⋈ branch)) vs
Πb_name, asset((σc_city=“PC”(customer)) ⋈ depositor ⋈ branch)
Query optimization is the process of selecting the most efficient query
l ti pl f i
evaluation plan for a given query
à Most efficient? How do we know?
Measures of Query Cost
Cost is generally measured as total elapsed time for answering q er
query
à Many factors contribute to time cost
disk accesses, CPU, or even network communication, ,
In a centralized system, disk access is the predominant cost
à average-seek-cost, average-block-read-cost, average-block-write-costg g g
For simplicity we use number of block transfers from/to disk as the cost measure
à ignore the difference in cost between sequential and random I/O à Ignore the difference in cost between disk reads and disk writes
i CPU
à ignore CPU costs
à ignore writing final output to disk
Real systems take all of these measures into account
Real systems take all of these measures into account
Selection Operation
File scan
à locate and retrieve records that fulfill a selection condition à locate and retrieve records that fulfill a selection condition à Full file scan: retrieve all records of a file (relation)
A1: linear search
à Scan each file block and test all records on the selection condition
b
à Cost estimate (number of disk blocks scanned) =
b
r # of blocks containing records from relation r à If selection is on a key attribute, cost = (If selection is on a key attribute, cost (
b b
r /2)/2) stop on finding record
à Linear search can be applied regardless of
l i di i
selection condition or
ordering of records in the file, or
availability of indices
Selection Operation (Cont.)
A2: binary search
A li bl if l i i li i h ib hi h
à Applicable if selection is an equality comparison on the attribute on which file is ordered
à Assume that the blocks of a relation are stored contiguously g y à Cost estimate (number of disk blocks to be scanned):
⎡log2(br)⎤ — cost of locating the first tuple by a binary search on the blocks
Pl b f bl k i i d h i f l i di i
Plus number of blocks containing records that satisfy selection condition
A3: primary index on candidate key, equality A3: primary index on candidate key, equality
à Retrieve a single record that satisfies the corresponding equality condition à Cost = HTi + 1
Selection Operation (Cont.)
A4: primary index on nonkey, equality
R i l i l d
à Retrieve multiple records
à Records will be on consecutive blocks
à Cost = HTCost HTii + number of blocks containing retrieved records number of blocks containing retrieved records
A5: equality on search-key of secondary index
à Retrieve a single record if the search-key is a candidate key
Cost = HTi + 1
R i l i l d if h k i did k
à Retrieve multiple records if search-key is not a candidate key
Cost = HTi + number of records retrieved
Can be very expensive!y p
each record may be on a different block
one block access for each retrieved record
Selections Involving Comparisons
Can implement selections of the form σA≤V (r) or σA ≥ V(r) by using
à a linear file scan or binary search,y ,
à or by using indices in the following ways:
A6: primary index comparison
A6: primary index, comparison
(Relation is sorted on A)
For σA ≥V(r) use index to find first tuple ≥ v and scan relation sequentially from there
from there
For σA≤V (r) just scan relation sequentially till first tuple > v; do not use index
A7: secondary index comparison
A7: secondary index, comparison
For σA ≥V(r) use index to find first index entry ≥ v and scan index sequentially from there, to find pointers to records.
For σA≤VA≤V (r) just scan leaf pages of index finding pointers to records, till first ( ) j p g d d g p d , entry > v
In either case,
Retrieving records that are pointed to requires an I/O for each record
Linear file scan may be cheaper if many records are to be fetched!
Sorting
Main memory based algorithms
B bbl i k
à Bubblesort, quicksort, etc.
à Rely on frequent exchange of elements
à Not good for relations that don’t fit in memoryNot good for relations that don t fit in memory
Use index on sort key
à If one exists, simply access records in order, p y
à Even if one does not exist, we may build one, and then use the index to read the relation in sorted order
I d b ildi + i d i l
Index building cost + index sequential access cost à May be quite effective
In general external sort-merge is a good choice
In general external sort-merge is a good choice
External Sort-Merge
Let M be the memory (buffer) size in blocks 1 Create sorted runs
1. Create sorted runs.
Let i = 1
repeat until all blocks of relation have been read:
(a) Read next M blocks of relation into memory (a) Read next M blocks of relation into memory (b) Sort the in-memory blocks
(c) Write sorted data to run Ri; increment i.
Let the total number of runs be N
2. Merge the runs (N-way merge). (assume for now that N < M)
1. Use N blocks of memory to buffer input runs, and 1 block to buffer output. Read the first block of each run into its buffer pagep g
2. repeat
1. Select the first record (in sort order) among all buffer pages
2. Write the record to the output buffer. If the output buffer is full write it to disk.p p 3. Delete the record from its input buffer page.
If the buffer page becomes empty then
read the next block (if any) of the run into the buffer.
3 il ll i b ff
3. until all input buffer pages are empty:
External Sort-Merge (Cont.)
If N ≥ M, several merge passes are required.
I h i f M 1 d
à In each pass, contiguous groups of M - 1 runs are merged.
à A pass reduces the number of runs by a factor of M -1, and creates runs longer by the same factor. g y
E.g. If M=11, and there are 90 runs, one pass reduces the number of runs to 9 à Repeated passes are performed till all runs have been merged into one.
External Sort-Merge – Example 1
M=3 M 3
External Sort-Merge – Example 2
M=3
32 12 47 25 29 63 52 2 30 55 65 5 9 20 15 31 8 50 22 45 10 1 11 24 3 33 17 58 40 26 36 16 4
5 8 9 15 20 31 50 55 65
2 12 25 29 30 32 47 52 63 1 3 10 11 17 22 24 33 45 4 16 26 36 40 58
2 5 8 9 12 15 20 25 29 30 31 32 47 50 52 55 63 65 1 3 4 10 11 16 17 22 24 26 33 36 40 45 58
1 2 3 4 5 8 9 10 11 12 15 16 17 20 22 24 25 26 29 30 31 32 33 36 40 45 47 50 52 55 58 63 65
External Sort-Merge (Cont.)
Cost analysis:
T l b f i d ⎡l (b /M)⎤
à Total number of merge passes required: ⎡logM–1(br/M)⎤.
à Disk accesses for initial run creation (as well as in each pass) is 2br
for final pass, we don’t count write cost p ,
we ignore final write cost for all operations since the output of an operation may be sent to the parent operation without being written to disk
Thus total number of disk accesses for external sorting:
2br + 2br ⎡logM–1(br /M)⎤ - br
⎡ / )⎤
= br( 2 ⎡logM–1(br /M)⎤ + 1)
Join Operation
Several different algorithms to implement joins
N d l j i
à Nested-loop join
à Block nested-loop join à Indexed nested-loop joinIndexed nested loop join à Merge-join
à Hash-join
Choice based on cost estimate
Running Example
à Number of records of customer: 10,000 depositor: 5000 à Number of blocks of customer: 400 depositor: 100
Nested-Loop Join
To compute the theta join r ⋈
θs
for each tuple tr in r do for each tuple tp ss in s do
if pair (tr, ts) satisfy the join condition θ add tr • ts to the result.
à r is called the outer relation à s the inner relation
Requires no indices and can be used with any kind of join condition.
Expensive since it examines every pair of tuples in the two
relations.
Nested-Loop Join (Cont.)
Worst case
h i l h ld bl k f h l i
à there is memory only to hold one block of each relation
n
r∗ b
s+ b
rdisk accesses.
Best case
à the smaller relation fits entirely in memory: use it as the inner relation br + bs
Example
à Worst case
5,000 ∗ 400 + 100 = 2,000,100 (depositor as outer relation)
1,000 ∗ 100 + 400 = 1,000,400 (customer as the outer relation), , , ( ) à If smaller relation (depositor) fits entirely in memory
100 + 400 = 500
Block Nested-Loop Join
Variant of nested-loop join
bl k f i l i i i d i h bl k f l i
à every block of inner relation is paired with every block of outer relation
for each block B
rof r do
f h bl k B f d
for each block B
sof s do for each tuple t
rin B
rdo
f h l i B d
for each tuple t
sin B
sdo
if (t
r, t
s) satisfy the join condition
h dd h l
then add t
r• t
sto the result
Worst case estimate
à Each block in the inner relation s is read once for each block in the outer relation
à brr ∗ bss + brr
Block Nested-Loop Join (Cont.)
Best case
à b
r+ b
sblock accesses.
Improvements
à Use M-2 disk blocks as blocking unit for outer relations, and remaining two blocks to buffer inner relation and output
Cost =
⎡b
r/ (M-2)⎤
∗b
s+ b
rà If equi-join attribute forms a key of inner relation, stop inner q j y , p loop on first match
à Scan inner loop forward and backward alternately, to make use
of the blocks remaining in buffer (with LRU replacement)
Block Nested-Loop Join - Example
5 q 7 r
M=4
r s
5 q 7 r
5 q 7 r
5 q 7 r
5 q 7 r
5 q 7 r
5 q 7 r
5 q 7 r
A 3 B 10 C 8
5 q 7 r 6 s
6 s 1 t A 3
6 s 1 t C 8
6 s 1 t E 1
6 s 1 t G 8
6 s 1 t I 7 J 2
6 s 1 t K 4
6 s 1 t M 3
6 s 1 t O 9 C 8
D 6 E 1 F 9
6 s 1 t 6 u 1 v
B 10 D 6
D 6 s
F 9 D 6 s E 1 t
H 2 J 2
I 7 r
L 5 I 7 r L 5 q
N 6 N 6 s
P 1 N 6 s P 1 t G 8
H 2 I 7 J 2
2 w 3 x 4 y 8 z
6 u 1 v
6 u 1 v
6 u 1 v
6 u 1 v
6 u 1 v
6 u 1 v
6 u 1 v
6 u 1 v J 2
K 4 L 5 M 3
8 z
2 w 3 x A 3
2 w 3 x C 8
2 w 3 x E 1
2 w 3 x G 8
2 w 3 x I 7
2 w 3 x K 4
2 w 3 x M 3
2 w 3 x O 9 N 6
O 9 P 1
B 10 A 3 x
D 6 A 3 x D 6 u
F 9 E 1 v
H 2 E 1 v H 2 w
J 2 J 2 w
L 5 J 2 w
N 6 J 2 w M 3 x
P 1 N 6 u P 1 v
* Red blocks are flushed
Block Nested-Loop Join – Example (cont.)
4 y 4 y 4 y 4 y 4 y
8
4 y 4 y 4 y
M=4
r s
8 z 8 z 8 z 8 z 8 z
I 7
8 z 8 z 8 z
A 3 B 10 C 8
5 q 7 r
6 A 3
B 10
C 8 D 6 C 8 z
E 1 F 9 C 8 z
G 8 H 2 C 8 z G 8 z
I 7 J 2
K 4 L 5 K 4 y
M 3 N 6 K 4 y
O 9 P 1 K 4 y C 8
D 6 E 1 F 9
6 s 1 t 6 u
1 v G 8 z
* Red blocks are flushed
G 8 H 2 I 7 J 2
2 w 3 x 4 y 8
D 6 s I 7 r N 6 s A 3 x E 1 v J 2 w N 6 u C 8 z K 4 y
Final result
J 2 K 4 L 5 M 3
8 z
E 1 t L 5 q P 1 t D 6 u H 2 w M 3 x P 1 v G 8 z
y N 6
O 9 P 1
Hash-Join
Applicable for equi-joins and natural joins.
A hash function h is used to partition tuples of both relations
h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the
ib f d d i h l j i
common attributes of r and s used in the natural join.
à r0, r1, . . ., rn : partitions of r
Each tuple tEach tuple tr ∈∈ r is put in partition rr is put in partition rii where i = h(t [JoinAttrs])where i h(tr [JoinAttrs]).
à s0, s1, . . ., sn : partitions of s
r tuples in r p
iineed to be compared with s tuples in s p p
iionly y
(Note: In the textbook, ri is denoted as Hri si is denoted as Hsi and n is denoted
( , i ri, i si,
as nh. )
Hash-Join - Example
Hash-Join Algorithm
1. Partition the relation s using hashing function h.
Wh i i i l i bl k f i d h
à When partitioning a relation, one block of memory is reserved as the output buffer for each partition.
2. Partition r similarly.
2. Partition r similarly.
3. For each i:
(a) Load si into memory (a) Load si into memory
build an in-memory hash index on it using the join attribute (using a different hash function than the earlier one h)
(b) R d th t l i f th di k b
(b) Read the tuples in ri from the disk one by one
For each tuple tr locate each matching tuple ts in si using the in-memory hash index. Output the concatenation of their attributes.
Relation s is called the build input and
r is called the probe input
Hash-Join algorithm (Cont.)
The value n and the hash function h is chosen such that each s
isho ld fit in memor
should fit in memory.
Hash-table overflow occurs in partition s
iif s
idoes not fit in memory
memory.
à Many tuples in s with same value for join attributes or bad hash function à Overflow resolution can be done in build phasep
Partition si is further partitioned using different hash function.
Partition ri must be similarly partitioned.
F il i h l b f d li
à Fails with large numbers of duplicates
Fallback option: use block nested loops join on overflowed partitions
Cost of Hash-Join
Cost of hash join (without recursive partitioning) 3(b + b ) + 4n (4n can be ignored)
3(br bs) 4n (4n can be ignored)
à If the entire build input can be kept in main memory, n can be set to 0 and the algorithm does not partition the relations into temporary files. Cost estimate goes down to b + b
down to br + bs.
Example:
à customer ⋈ depositor
à memory size: 20 blocks; bdepositor= 100 and bcustomer = 400.
à depositor is build input
à Partition it into 6 partitions each of size less than 20 blocks
à Partition it into 6 partitions, each of size less than 20 blocks
à Similarly, partition customer into 6 partitions each of size about 70
à Therefore total cost:
3(100 + 400) + 4*6 = 1,524 block transfers
à ignores cost of writing partially filled blocks
Evaluation of Expressions
So far, we have seen algorithms for individual operations
How do you evaluate an entire expression tree?
Materialization
à generate results of an expression whose inputs are relations or are already computed, materialize (store) it on disk. Repeat.
Pipelining
Pipelining
à pass on tuples to parent operations even as an operation is being executed
Materialization
Evaluate one operation at a time, starting at the lowest-level.
Use intermedi te res lts t ri liz d into t por ry r l tions to e l te ne t le el
Use intermediate results materialized into temporary relations to evaluate next-level operations.
E.g., in figure below
)
g g (
à compute and store
à then compute and store its join with customer
à and finally compute the projections on customer name )
2500(account
balance<
σ
à and finally compute the projections on customer-name.
Materialization (Cont.)
Materialized evaluation is always applicable
Cost of writing results to disk and reading them back can be quite high
à O r cost form las for operations ignore cost of writing res lts to disk à Our cost formulas for operations ignore cost of writing results to disk à Overall cost = Sum of costs of individual operations +
cost of writing intermediate results to disk
Pipelining
Evaluate several operations simultaneously
simultaneously
à passing the results of one operation on to the next.
E.g., in expression tree
à don’t store result of selection à instead, pass tuples directly to
the join
à Similarly don’t store result of à Similarly, don t store result of
join but pass tuples directly to projection
Pipelining (cont.)
Much cheaper than materialization
d l i di k
à no need to store a temporary relation to disk.
Pipelining may not always be possible
à e g sort hash join: m st wait for entire inp t to materialize à e.g., sort, hash-join: must wait for entire input to materialize à Very difficult to achieve a lengthy chain of pipeline
For pipelining to be effective For pipelining to be effective
à use evaluation algorithms that generate output tuples even as tuples are received for inputs to the operation