S tr in g M a tc hi ng

(1)

S tr ing M at chi ng A lgor it hm s D ae -K i K ang

(2)

S tr in g M a tc hi ng

In p ut

A[1…n]:Text string P[1…m]: Pattern stringP[1…m]: Pattern string m<< n

G oa l

Determine if A[1…n] includes P[1…m]

(3)

N a ïv e M a tc hi ng o r S tu p id M a tc hi ng na iv eM a tc hi ng (A [ ], P [ ]) {

▷n: length of A[ ],m: length of P[ ]

fo r i ← 1 to n -m + 1 { fo r i ← 1 to n -m + 1 { if (P[ 1 … m ] = A [i… i+ m -1 ]) th e n T h e re i s a m a tc h a t A [i] ; } } R u n n in g t im e : O (m n )

(4)

tporaostacyobobA[ ] raos raos

P[ ]1. 2.

H o w N a ïv e M a tc h in g W o rk s ? H o w N a ïv e M a tc h in g W o rk s ? H o w N a ïv e M a tc h in g W o rk s ? H o w N a ïv e M a tc h in g W o rk s ?

raos raos

3. 12.

(5)

...abcdabcd... abcdabcwz

A[ ] P[ ]1.

mismatch

W h e n N a ïv e M a tc h in g i s i n e ff ic ie n t? W h e n N a ïv e M a tc h in g i s i n e ff ic ie n t? W h e n N a ïv e M a tc h in g i s i n e ff ic ie n t? W h e n N a ïv e M a tc h in g i s i n e ff ic ie n t?

abcdabcwz abcdabcwz

P[ ] abcdabcwz

abcdabcwz abcdabcwz

1. 2. 3. 4. 5.

(6)

Ra b in -K a rp A lg o ri th m

C ha ng e a p a tt e rn i nt o a n um b e r a nd c ha ng e a c o m p a re d p o rt io n o f a t ex t st ri ng i nt o a n um b e r

N ow , c o m p a re t w o n um b e rs ( in st e a d o f co m p a ri ng t w o s tr in g s)

S tr in g N um b e r

Depends on the cardinality of a character set ∑ Example: ∑ = {a, b, c, d, e} |∑| = 5 Convert a, b, c, d, e into0, 1, 2, 3, 4 String“cad” will be2*52 +0*51 +3*50 = 28

(7)

O v e rh e a d o f th e C o n v e rs io n O v e rh e a d o f th e C o n v e rs io n O v e rh e a d o f th e C o n v e rs io n O v e rh e a d o f th e C o n v e rs io n • C o n v e rs io n t im e o f A [i …i + m -1 ]

–a i = A[i+m-1] +d(A[i+m-2] +d(A[i+m-3] +d(… +d (A[i]))…) –Θ(m)time complexity –Total matching of A[1…n] takesΘ(mn)–Total matching of A[1…n] takesΘ(mn) –No better than naïve matching

• H o rn e r’ s r u le R e g a rd le s s o f m , c a lc u la te t h e n u m b e r a s f o ll o w s :

–a i=d(a i-1–dm-1 A[i-1]) + A[i+m-1] –We calculate dm-1 once –Two multiplications and two additions

(8)

acebbceeaabceedb

eeaabp= 4*54 + 4*53 + 0*52 + 0*51 + 1 = 3001 a 1=0*54 +2*53 +4*52 +1*51 +1 = 356 acebbceeaabceedb

P[ ] A[ ]

O n e m a tc h in g e x a m p le O n e m a tc h in g e x a m p le O n e m a tc h in g e x a m p le O n e m a tc h in g e x a m p le

a2= 5(a1-0*54 )+2 = 1782

acebbceeaabceedb a 3=5(a 2-2*54 )+4 = 2664

acebbceeaabceedb

a7=5(a6-2*54 )+1 = 3001

acebbceeaabceedb

(9)

B a s ic B a s ic B a s ic B a s ic R a b in K a rp A lg o ri th m R a b in K a rp A lg o ri th m R a b in K a rp A lg o ri th m R a b in K a rp A lg o ri th m

basicRabinKarp(A, P, d, q) { ▷n: length of A[ ], m: length of P[ ] p←0; a 1←0; forforforfori← 1totototom{▷calculatea 1 p← dp+ P[i];p← dp+ P[i]; a 1← da 1+ A[i]; } forforforfori← 1totototon-m+1{ ifififif(i≠ 1) thenthenthenthena i← d(a i-1–dm-1 A[i-1]) + A[i+m-1]; ifififif(p=a i) thenthenthenthenmatched at A[i]; } } Total running time: Total running time: ΘΘ((nn))

(10)

P ro b le m o f b a s ic R a b in K a rp A lg o ri th m P ro b le m o f b a s ic R a b in K a rp A lg o ri th m P ro b le m o f b a s ic R a b in K a rp A lg o ri th m P ro b le m o f b a s ic R a b in K a rp A lg o ri th m • D e p e n d in g o n | Σ | a n d m , a

i

c a n b e b ig

–bigger than CPU register size –and causes overflow

• So lu ti o n

–Limit the size of a iusing modulo operation –a i=d(a i-1–dm-1 A[i-1]) + A[i+m-1] b i= (d(b i-1–(dm-1 modq)A[i-1]) + A[i+m-1]) modq –q is a big prime number, but dqshould be fit in a register

(11)

acebbceeaabceedb

eeaabp= (4*54 + 4*53 + 0*52 + 0*51 + 1) mod 113 = 63 a 1= (0*54 +2*53 +4*52 +1*51 +1) mod 113 = 17 acebbceeaabceedb

P[ ] A[ ]

M a tc h in g u s in g m o d u lo M a tc h in g u s in g m o d u lo M a tc h in g u s in g m o d u lo M a tc h in g u s in g m o d u lo

a 2= (5(a 1-0*(54 mod 113))+2) mod 113 = 87

acebbceeaabceedb a 3= (5(a 2-2*(54 mod 113))+4) mod 113 = 65

acebbceeaabceedb

a 7= (5(a 6-2*(54 mod 113))+1) mod 113 = 63

acebbceeaabceedb

(12)

R a b in R a b in R a b in R a b in ---- K a rp A lg o ri th m K a rp A lg o ri th m K a rp A lg o ri th m K a rp A lg o ri th m

RabinKarp(A, P, d, q) { ▷n: size ofA[ ], m: size of P[ ] p←0; b 1←0; forforforfori← 1totototom{▷b 1 p← (dp+ P[i]) modq; b← (db+ A[i]) modq;b 1← (db 1+ A[i]) modq; } h← dm-1 modq; forforforfori← 1totototon-m+1{ ifififif(i≠ 1) thenthenthenthenb i← (d(b i-1–hA[i-1]) + A[i+m-1]) mod q; ifififif(p=b i) thenthenthenthen ifififif(P[1…m] = A[i…i+m-1])thenthenthenthen there is a match at A[i]; } } Average running time: Average running time: ΘΘ((nn))

(13)

A ut o m a ta B a se d M a tc hi ng

Fi ni te A ut o m a ta

Fi ni te sym b ol s, s ta tes a nd t ra ns iti on

Fi ve tu p le : (Q , q 0 , A , ∑ , δ)

Q : states Q0 : start stateQ0 : start state A : accepted states ∑ : input alphabet δ: state transition diagram

S ta te s re p re se nt a s na p sh o t o f m a tc hi ng p ro ce ss

(14)

A ut o m a ta t ha t ch e ck s a b a b a ca

aaaabbc

a a aa

0 1 2 3 4 5 6 7

aaaabb b

c b

S : d v g a n b b a c ta b a b a a b a b a c a b a b a b a c a a g b k

(15)

0 1

abc 021

001

StateInput symbol 00 00 00

dez

Im p le m e n ta ti o n o f A u to m a ta Im p le m e n ta ti o n o f A u to m a ta Im p le m e n ta ti o n o f A u to m a ta Im p le m e n ta ti o n o f A u to m a ta

0 1

abc 1000 1200

ElseStateInput symbol 21 3 4 5 6 7021

007

641

005

041

003

021 000000

0 000000

0 0000000

2

1 3 4 5 6 7

1200 3000 1400 5000 1460 7000 1200

(16)

A lg o ri th m t h a t c h e c k s m a tc h in g u s in g a u to m a ta A lg o ri th m t h a t c h e c k s m a tc h in g u s in g a u to m a ta A lg o ri th m t h a t c h e c k s m a tc h in g u s in g a u to m a ta A lg o ri th m t h a t c h e c k s m a tc h in g u s in g a u to m a ta

FA-Matcher (A, δ, f ) ▷f: accepted state { ▷n: length of A[ ] q← 0;q← 0; forforforfori← 1totototon{ q← δ(q, A[i]); ifififif(q=f) thenthenthenthenwe found a match at A[i-m+1]; } } Total running time: Θ(n+ |∑|m)

(17)

C o m p u te C o m p u te C o m p u te C o m p u te T ra n s it io n F u n c ti o n T ra n s it io n F u n c ti o n T ra n s it io n F u n c ti o n T ra n s it io n F u n c ti o n

Compute_Transition_Function(P, ∑) { m← |P|; ▷length of pattern for for for for q← 0totototomdodododo for for for for each character a ∈ ∑dodododo k← min(m+1,q+2); repeat repeat repeat repeat k← k -1repeat repeat repeat repeat k← k -1 until until until until P[1..k] is a suffix of P[1..q]a; δ(q, a) ← k; } Running time: Running time: ΘΘ((|∑|m))

(18)

K M P (K nu th -M o rr is -P ra tt ) A lg o ri th m

Si m ila r to a ut om a ta b a sed m a tc hi ng

C om m on a lit ies

Prepare states for mismatch Simpler than automata based matchSimpler than automata based match ...abcdabcd... abcdabcwz abcdabcwz

A[ ] P[ ]

(19)

abcdabcwz

12345678 π[8] = 4

9 P[ ]

P re p a re r e tu rn p o in ts f o r e a c h m is m a tc h P re p a re r e tu rn p o in ts f o r e a c h m is m a tc h P re p a re r e tu rn p o in ts f o r e a c h m is m a tc h P re p a re r e tu rn p o in ts f o r e a c h m is m a tc h

Matched upto “abcdabc”, failed at “w” “abc” at 1,2,3 and “abc” at 5,6,7 are the same. 0111123411

12345678910 π[ ] abcdabcwz

So, we compare mismatched text char with P[4] For each index of pattern, we prepare the return point.

(20)

K M P A lg o ri th m K M P A lg o ri th m K M P A lg o ri th m K M P A lg o ri th m

KMP(A[ ], P[ ]) { preprocessing(P); i← 1;▷index pointer of text j← 1;▷index pointer of pattern ▷n: size of A[ ], m: size of P[ ] whilewhilewhilewhile(i≤n) {whilewhilewhilewhile(i≤n) { ifififif(j= 0ororororA[i] = P[j]) thenthenthenthen{ i++; j++; } elseelseelseelsej← π[j]; ifififif(j=m+1) thenthenthenthen{ There is a match at A[i-m]; j← π[j]; } } } Running time: Running time: ΘΘ((nn))

(21)

P re p ro c e s s in g P re p ro c e s s in g P re p ro c e s s in g P re p ro c e s s in g

preprocessing(P) { m← |P|; ▷length of pattern π[1] ← 0; k← 0; for for for for q← 2totototomdodododo whilewhilewhilewhile(k > 0) andandandand(P[k+1] ≠P[q]) dowhilewhilewhilewhile(k > 0) andandandand(P[k+1] ≠P[q]) do k← π[k]; ifififif(P[k+1] = P[q])thenthenthenthenk ←k+1; π[q]← k; returnreturnreturnreturnπ; } Running time: Running time: ΘΘ((mm))