S tr ing M at chi ng A lgor it hm s D ae -K i K ang
S tr in g M a tc hi ng
In p ut
A[1…n]:Text string P[1…m]: Pattern stringP[1…m]: Pattern string m<< nG oa l
Determine if A[1…n] includes P[1…m]N a ïv e M a tc hi ng o r S tu p id M a tc hi ng na iv eM a tc hi ng (A [ ], P [ ]) {
▷n: length of A[ ],m: length of P[ ]fo r i ← 1 to n -m + 1 { fo r i ← 1 to n -m + 1 { if (P[ 1 … m ] = A [i… i+ m -1 ]) th e n T h e re i s a m a tc h a t A [i] ; } } R u n n in g t im e : O (m n )
tporaostacyobobA[ ] raos raos
P[ ]1. 2.
H o w N a ïv e M a tc h in g W o rk s ? H o w N a ïv e M a tc h in g W o rk s ? H o w N a ïv e M a tc h in g W o rk s ? H o w N a ïv e M a tc h in g W o rk s ?
raos raos3. 12.
...abcdabcd... abcdabcwz
A[ ] P[ ]1.
mismatch
W h e n N a ïv e M a tc h in g i s i n e ff ic ie n t? W h e n N a ïv e M a tc h in g i s i n e ff ic ie n t? W h e n N a ïv e M a tc h in g i s i n e ff ic ie n t? W h e n N a ïv e M a tc h in g i s i n e ff ic ie n t?
abcdabcwz abcdabcwzP[ ] abcdabcwz
abcdabcwz abcdabcwz
1. 2. 3. 4. 5.
Ra b in -K a rp A lg o ri th m
C ha ng e a p a tt e rn i nt o a n um b e r a nd c ha ng e a c o m p a re d p o rt io n o f a t ex t st ri ng i nt o a n um b e r
N ow , c o m p a re t w o n um b e rs ( in st e a d o f co m p a ri ng t w o s tr in g s)
S tr in g N um b e r
S tr in g N um b e r
Depends on the cardinality of a character set ∑ Example: ∑ = {a, b, c, d, e} |∑| = 5 Convert a, b, c, d, e into0, 1, 2, 3, 4 String“cad” will be2*52 +0*51 +3*50 = 28O v e rh e a d o f th e C o n v e rs io n O v e rh e a d o f th e C o n v e rs io n O v e rh e a d o f th e C o n v e rs io n O v e rh e a d o f th e C o n v e rs io n • C o n v e rs io n t im e o f A [i …i + m -1 ]
–a i = A[i+m-1] +d(A[i+m-2] +d(A[i+m-3] +d(… +d (A[i]))…) –Θ(m)time complexity –Total matching of A[1…n] takesΘ(mn)–Total matching of A[1…n] takesΘ(mn) –No better than naïve matching• H o rn e r’ s r u le R e g a rd le s s o f m , c a lc u la te t h e n u m b e r a s f o ll o w s :
–a i=d(a i-1–dm-1 A[i-1]) + A[i+m-1] –We calculate dm-1 once –Two multiplications and two additionsacebbceeaabceedb
eeaabp= 4*54 + 4*53 + 0*52 + 0*51 + 1 = 3001 a 1=0*54 +2*53 +4*52 +1*51 +1 = 356 acebbceeaabceedb
P[ ] A[ ]
O n e m a tc h in g e x a m p le O n e m a tc h in g e x a m p le O n e m a tc h in g e x a m p le O n e m a tc h in g e x a m p le
a2= 5(a1-0*54 )+2 = 1782acebbceeaabceedb a 3=5(a 2-2*54 )+4 = 2664
acebbceeaabceedb
a7=5(a6-2*54 )+1 = 3001
acebbceeaabceedb
B a s ic B a s ic B a s ic B a s ic R a b in K a rp A lg o ri th m R a b in K a rp A lg o ri th m R a b in K a rp A lg o ri th m R a b in K a rp A lg o ri th m
basicRabinKarp(A, P, d, q) { ▷n: length of A[ ], m: length of P[ ] p←0; a 1←0; forforforfori← 1totototom{▷calculatea 1 p← dp+ P[i];p← dp+ P[i]; a 1← da 1+ A[i]; } forforforfori← 1totototon-m+1{ ifififif(i≠ 1) thenthenthenthena i← d(a i-1–dm-1 A[i-1]) + A[i+m-1]; ifififif(p=a i) thenthenthenthenmatched at A[i]; } } Total running time: Total running time: ΘΘ((nn))P ro b le m o f b a s ic R a b in K a rp A lg o ri th m P ro b le m o f b a s ic R a b in K a rp A lg o ri th m P ro b le m o f b a s ic R a b in K a rp A lg o ri th m P ro b le m o f b a s ic R a b in K a rp A lg o ri th m • D e p e n d in g o n | Σ | a n d m , a
ic a n b e b ig
–bigger than CPU register size –and causes overflow• So lu ti o n
–Limit the size of a iusing modulo operation –a i=d(a i-1–dm-1 A[i-1]) + A[i+m-1] b i= (d(b i-1–(dm-1 modq)A[i-1]) + A[i+m-1]) modq –q is a big prime number, but dqshould be fit in a registeracebbceeaabceedb
eeaabp= (4*54 + 4*53 + 0*52 + 0*51 + 1) mod 113 = 63 a 1= (0*54 +2*53 +4*52 +1*51 +1) mod 113 = 17 acebbceeaabceedb
P[ ] A[ ]
M a tc h in g u s in g m o d u lo M a tc h in g u s in g m o d u lo M a tc h in g u s in g m o d u lo M a tc h in g u s in g m o d u lo
a 2= (5(a 1-0*(54 mod 113))+2) mod 113 = 87acebbceeaabceedb a 3= (5(a 2-2*(54 mod 113))+4) mod 113 = 65
acebbceeaabceedb
a 7= (5(a 6-2*(54 mod 113))+1) mod 113 = 63
acebbceeaabceedb
R a b in R a b in R a b in R a b in ---- K a rp A lg o ri th m K a rp A lg o ri th m K a rp A lg o ri th m K a rp A lg o ri th m
RabinKarp(A, P, d, q) { ▷n: size ofA[ ], m: size of P[ ] p←0; b 1←0; forforforfori← 1totototom{▷b 1 p← (dp+ P[i]) modq; b← (db+ A[i]) modq;b 1← (db 1+ A[i]) modq; } h← dm-1 modq; forforforfori← 1totototon-m+1{ ifififif(i≠ 1) thenthenthenthenb i← (d(b i-1–hA[i-1]) + A[i+m-1]) mod q; ifififif(p=b i) thenthenthenthen ifififif(P[1…m] = A[i…i+m-1])thenthenthenthen there is a match at A[i]; } } Average running time: Average running time: ΘΘ((nn))A ut o m a ta B a se d M a tc hi ng
Fi ni te A ut o m a ta
Fi ni te sym b ol s, s ta tes a nd t ra ns iti on
Fi ve tu p le : (Q , q 0 , A , ∑ , δ)
Q : states Q0 : start stateQ0 : start state A : accepted states ∑ : input alphabet δ: state transition diagramS ta te s re p re se nt a s na p sh o t o f m a tc hi ng p ro ce ss
A ut o m a ta t ha t ch e ck s a b a b a ca
aaaabbca a aa
0 1 2 3 4 5 6 7
aaaabb bc b
S : d v g a n b b a c ta b a b a a b a b a c a b a b a b a c a a g b k
0 1
abc 021
001
StateInput symbol 00 00 00
dez
Im p le m e n ta ti o n o f A u to m a ta Im p le m e n ta ti o n o f A u to m a ta Im p le m e n ta ti o n o f A u to m a ta Im p le m e n ta ti o n o f A u to m a ta
0 1abc 1000 1200
ElseStateInput symbol 21 3 4 5 6 7021
007
641
005
041
003
021 000000
0 000000
0 0000000
2
1 3 4 5 6 7
1200 3000 1400 5000 1460 7000 1200
A lg o ri th m t h a t c h e c k s m a tc h in g u s in g a u to m a ta A lg o ri th m t h a t c h e c k s m a tc h in g u s in g a u to m a ta A lg o ri th m t h a t c h e c k s m a tc h in g u s in g a u to m a ta A lg o ri th m t h a t c h e c k s m a tc h in g u s in g a u to m a ta
FA-Matcher (A, δ, f ) ▷f: accepted state { ▷n: length of A[ ] q← 0;q← 0; forforforfori← 1totototon{ q← δ(q, A[i]); ifififif(q=f) thenthenthenthenwe found a match at A[i-m+1]; } } Total running time: Θ(n+ |∑|m)C o m p u te C o m p u te C o m p u te C o m p u te T ra n s it io n F u n c ti o n T ra n s it io n F u n c ti o n T ra n s it io n F u n c ti o n T ra n s it io n F u n c ti o n
Compute_Transition_Function(P, ∑) { m← |P|; ▷length of pattern for for for for q← 0totototomdodododo for for for for each character a ∈ ∑dodododo k← min(m+1,q+2); repeat repeat repeat repeat k← k -1repeat repeat repeat repeat k← k -1 until until until until P[1..k] is a suffix of P[1..q]a; δ(q, a) ← k; } Running time: Running time: ΘΘ((|∑|m))K M P (K nu th -M o rr is -P ra tt ) A lg o ri th m
Si m ila r to a ut om a ta b a sed m a tc hi ng
C om m on a lit ies
Prepare states for mismatch Simpler than automata based matchSimpler than automata based match ...abcdabcd... abcdabcwz abcdabcwzA[ ] P[ ]
abcdabcwz
12345678 π[8] = 4
9 P[ ]
P re p a re r e tu rn p o in ts f o r e a c h m is m a tc h P re p a re r e tu rn p o in ts f o r e a c h m is m a tc h P re p a re r e tu rn p o in ts f o r e a c h m is m a tc h P re p a re r e tu rn p o in ts f o r e a c h m is m a tc h
Matched upto “abcdabc”, failed at “w” “abc” at 1,2,3 and “abc” at 5,6,7 are the same. 011112341112345678910 π[ ] abcdabcwz
So, we compare mismatched text char with P[4] For each index of pattern, we prepare the return point.