As any idiot will tell you, the first step to finding a string W in text T, is to find the first byte of W within T. That much we know.
uint8_t *find_str(uint8_t *T, size_t lenT, uint8_t *W, size_t lenW) { uint8_t *F; ssize_t i; restart_the_whole_thing: F= memchr(T, W[0], lenT); if (!F) return NULL; lenT -= F-T; if (lenT<lenW) return NULL; T= F;Now, T[0]==W[0]. As long as T[i]==W[i], we're good. There is no way to skip this check, either.
for(i=0;i<lenW;i++) if (W[i]!=T[i]) { fail: // DANGER!DANGER! } return T; }What you do in the case of a failure separates high school string search from PhD string search. The high school search goes like:
fail: T++; lenT--; goto restart_the_whole_thing; // yeah, rightFor PhD search, there are several algorithms. The simplest-looking and most-sedgewick-like is Knuth-Morris-Pratt. There are a lot of people talking about this but I can't wrap my head around why the algorithm works.
Understanding it is crucial because I'm going to be using it in a context different than in-memory search. Specifically, I will use it to do search/replace in some huge files with some additional constraints.
All capital letter values below are strings or arrays. Indices start from 0. Negative indices are from the end of a string/array. i.e. W[-1] denotes the last byte of W. Wi denotes the first i bytes of W, i.e. W[0..i-1]. nW denotes the last n bytes. := is the assignment operator.
If W matches T, then it's always depicted at the end of T. I'm not interested in overlapping matches so the rest of the text isn't relevant. In the real implementation, T would be advanced by the length of the match and the search would restart.
Ti = Wi (1) T[i] ≠ W[i] (2)The trivial algorithm runs RESTART for all mismatches.
RESTART: T+= 1, i= 0 REPOS: if T[∃minimal(k)]=W[0], T+=k,i=0 else failHowever, we could analyse W and predetermine if T should be incremented even more. For instance,
i= 012345678 T= ababababc W= ababcHere, we would get a mismatch at i=4. Would it make sense to RESTART at this point? Of course not. We know that T[1]='b' and it won't match W[0]='a' anyway.
The main idea of the algorithm is to analyse W such that for each possible mismatch at i, T[i]≠W[i], determine how much T should be advanced and what value should be used for i from there on.
Now, let's think about how much T should be advanced. In order to advance T in a meaningful manner (i.e. not guaranteed to fail), we need to make sure that some bytes at the advanced position matches the first bytes of W. In the example above, advancing by 1 didn't work, but advancing by 2 would work:
T[2]=W[0] and T[3]=W[1] due to (1), T[2]=W[2] and T[3]=W[3] therefore, the above observation is really only about W: W[2]=W[0] and W[3]=W[1]Generalizing this strategy, we can say the following. If Wi contains a prefix of Wi somewhere other than position 0, then we can advance T to match that part of W. As a graphic:
R= r0 r1 .. rn Wi= r0 r1 .. rn ... Ti= r0 r1 .. rn ... also, Wi= ... r0 r1 .. rn ... ^--^ ^-----------^ | | | --- skip comparing these --- advance T by this amountIf we were to write it more formally:
Wi = R P = Q R S P,Q,R≠'' (3)
Before going further let's reason about S. If S is not empty, then W and T will look like this:
W = R0 R1 .. Rr P0 P1 .. Pp ... W = Q0 Q1 .. Qq R0 R1 .. Rr S0 S1 .. Ss ... T = Q0 Q1 .. Qq R0 R1 .. Rr S0 S1 .. Ss ...After doing our updates to T and i:
W = R0 R1 .. Rr P0 P1 .. Pp ... T = R0 R1 .. Rr S0 S1 .. Ss ... ^ | --- i now points hereAt this point, there are two possibilities:
S0 ≠ P0: For solutions ending up like this, the solution is no good because the update is guaranteed to cause a mismatch. Therefore, such solutions shall be discarded.
S0 = P0: For this case, there is another solution where
R' = R S0 P' = P[1..] Q' = Q S' = S[1..] Wi = R P = Q R S = R P[0] P[1..] = Q R S[0] S[1..] since P[0]=S[0] = R S[0] P[1..] = Q R S[0] S[1..] = R' P' = Q' R' S'This means that, if we find a solution to (3) where S≠'', we can extend R to the right until S becomes empty. If extension doesn't work, then that solution was useless anyway.
These two cases result in the following fact.
If prefix R repeats within Wi, it has to reach the end of Wi in order to be useful. In other words, it also has to be a suffix of Wi.Our partitioning has been simplified quite a bit now:
Wi = R P = Q R (4)Let's consider the case where there are at least two solutions to (4) with R values A and B. WLOG, let |A|>|B|.
Since A and B both match the initial part of Wi, B is a prefix of A. Last parts of A and B both match the last part of Wi, therefore B is also a suffix of A.
Since A is longer than B, it starts earlier as a suffix in Wi. If we chose B as our preferred R-value, then we would miss an earlier opportunity to match W.
Wi= A PA = QA A Wi= B PB = QB BIf we were to advance T by |QB|>|QA|, then we would skip over the initial part of A where it appears as a suffix. That is a legitimate potentially matching position because we know that T matches A. Therefore we mustn't use B. Doing this for all A,B pairs we realize that we must use only the longest R possible, for correctness.
Since R[i] is the longest prefix/suffix in Wi, it relates to the bytes W[0] thru W[i-1]. L[i]= |R[i]|. Let's look at some initial values of L and R.
If R[i-1]='', then there is no prefix to which we can attach Wi-1 and make a solution for i. The only option is to start a new prefix with the value wi-1 and that can only happen if wi-1=w0. Otherwise, we have no solution.
if R[i-1]='' if wi-1=w0 R[i]:= wi-1 L[i]:= 1 else R[i]:= '' L[i]:= 0 endif endifOn the other hand, if R[i-1]≠'', then we can try to construct R[i] by appending wi-1 to R[i-1]:
R[i]= R[i-1] wi-1We already know that R[i] is suffix of Wi because of the construction above and the fact that R[i-1] is a suffix of Wi-1. However, R[i] can be used only if it is also a prefix. So, wi-1 must match the first byte just after R within Wi. That position is given by |R[i-1]|=L[i-1].
if R[i-1]≠'' if wi-1=wL[i-1] R[i]:= R[i-1] wi-1 L[i]:= |R[i]| = L[i-1]+1 else it's complicated endif endifAfter handling all these cases, we have arrived at the most interesting part of the algorithm. To summarize, we have a non-empty R for i-1, but it can't be extended by adding wi-1 to it.
We know that Ri is not longer than Ri-1 since we tried to construct such an Ri and failed. If Ri exists, then it must be an extension of a proper prefix of Ri-1. Let me put it this way. If
Ri = Z wi-1then |Z|<|Ri-1| because we tried Z=Ri-1 and failed. Now, if |Z|<|Ri-1| then Z must be a proper prefix of Ri-1 since they both match the first part of W, that is:
Wi-1= Ri-1 Pi-1 Wi= Z wi-1 PiNow, let's write down our facts and see what we can do.
Ri-1= Z Y Ri= Z wi-1 Wi= Qi Ri Wi= Qi Z wi-1 (replaced value of Ri) Wi-1= Qi Z (if we remove the last byte from Wi, we get Wi-1) Wi-1= Qi-1 Ri-1 (partitioning for Wi-1) If we equate the last two expressions, Qi-1 Ri-1 = Qi ZThe last equation is a very important result. It tells us that Z is a suffix of Ri-1. Well, it was also a prefix of Ri-1. I wonder where I can find a substring Z of Ri-1, where Z is both a prefix and a suffix? Wait, I KNOW! We already calculated it. It's stored right there in
Z= R[ |Ri-1| ]If Z turns out to be the empty string, then we have no solution. Therefore we proceed with setting R[i]='' and so on.
Otherwise, now we use Z as a basis to construct R[i] by appending wi-1 to it. We know that the suffix condition holds because we just constructed it with the last byte. However, the prefix condition also has to hold:
Ri= Z wi-1 Wi= Ri Pi Wi= Z wi-1 Pi therefore, wi-1 must equal w|Z|Here, Z is guaranteed to be the best we can do because it was computed by the same algorithm.
The behaviour of the algorithm at this point is identical to the one we did before, while extending Ri-1 to get Ri. If the prefix condition fails, our only solution exists within the set of prefix/suffix substrings of Z and so on. Let's write down the whole code before we can simplify it:
if R[i-1]='' if wi-1=w0 R[i]:= wi-1 L[i]:= 1 else R[i]:= '' L[i]:= 0 endif endif if R[i-1]≠'' if wi-1=wL[i-1] R[i]:= R[i-1] wi-1 L[i]:= |R[i]| = L[i-1]+1 else it's complicated Z:= R[ |R[i-1]| ] z:= |Z|= L[ |R[i-1]| ]= L[ L[i-1] ] try_again: if Z≠'' if wi-1=wz R[i]:= Z wi-1 L[i]:= |R[i]|= |Z| + 1= z+1 else Z:= R[z] z:= |Z|= L[z] goto try_again endif else no solution, do it like at the beginning R[i]:= '' L[i]:= 0 endif endif endifNow that we're finished with R, we can totally remove it from the code and use only L.
if L[i-1]=0 if wi-1=w0 L[i]:= 1 else L[i]:= 0 endif endif if L[i-1]≠0 if wi-1=wL[i-1] L[i]:= L[i-1]+1 else it's complicated z:= L[ L[i-1] ] try_again: if z≠0 if wi-1=wz L[i]:= z+1 else z:= L[z] goto try_again endif else no solution, do it like at the beginning L[i]:= 0 endif endif endifSince the outmost conditions are mutually exclusive, we can re-order them and put them into a single if-else.
if L[i-1]≠0 if wi-1=wL[i-1] L[i]:= L[i-1]+1 else it's complicated z:= L[ L[i-1] ] try_again: if z≠0 if wi-1=wz L[i]:= z+1 else z:= L[z] goto try_again endif else no solution, do it like at the beginning L[i]:= 0 endif endif else if wi-1=w0 L[i]:= 1 else L[i]:= 0 endif endifNow we can combine the two places where an extension is done by assigning i-1 to z initially.
z:= i-1 while L[z]≠0 if wi-1=wL[z] L[i]:= L[z]+1 break while loop else z:= L[z] endif if L[z]=0 if wi-1=w0 L[i]:= 1 else L[i]:= 0 endifPrettifying it further:
z:= i-1 while L[z]≠0 and wi-1≠wL[z], z:= L[z] if L[z]=0 L[i]:= wi-1=w0 ? 1:0 else L[i]:= L[z]+1 endifIn the last part, we can change the if-condition to get rid of the conditional expression:
z:= i-1 while L[z]≠0 and wi-1≠wL[z], z:= L[z] if wi-1=wL[z] L[i]:= L[z] + 1 else L[i]:= 0 endif
When we advance T by |Qi|, we also need to decrement i by the same amount so that the comparison can continue beyond Ri.
Wi= Qi Ri |Wi|= i = |Qi| + |Ri| Q[i]= i - R[i], if R[i]≠0 Q[i]= 0 , if R[i]=0For R[i]=0, I will use the sentinel value Q[i]=0. So now, I guess we're doing this:
#include <stdio.h> #include <stdlib.h> #include <string.h> int *compute_Q(char *W) { int n,z,i,*L,*Q; n= strlen(W); L= calloc(n,sizeof(*L)); for(i=2;i<n;i++) { z= i-1; while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z]; if (W[i-1]==W[L[z]]) L[i]= L[z]+1; // calloc makes it zero otherwise } Q= L; // re-use the space for(i=0;i<n;i++) Q[i]= L[i]? i-L[i] : 0; return Q; } char *kmp(char *T, char *W,int *Q) { int i,n; char *eT; eT= T+strlen(T); i= 0; n= strlen(W); if (T+n>eT) return NULL; while(1) { if (T[i]==W[i]) { i++; if (i==n) return T; continue; } if (!i) { T= strchr(T+1, W[0]); if (!T) return NULL; } else if (Q[i]) { T+= Q[i]; i-= Q[i]; } else { T+= i; i= 0; } if (T+n>eT) return NULL; } } void kmptest(char *T, char *W) { char *oT=T; int *Q, n; Q= compute_Q(W); n= strlen(W); printf("T: %s\nW: %s\n",T,W); T-= n; while ((T=kmp(T+n,W,Q))) printf("found at %d\n", T-oT); free(Q); } struct { char *T, *W; } TT[]= { { "baabbbaabbaabbbabaabbbaabaabababba", "baababa" }, { "AAAABAAAAABBBAAAAB", "AAAB", }, { "hayhello", "hell" }, { NULL, NULL } }; int main() { int i; for(i=0;TT[i].T;i++) kmptest(TT[i].T,TT[i].W); return 0; }The above implementation isn't really O(|W|+|T|) but O(|T|2+|W|*|T|) because of the heavy usage of strlen. This can be fixed easily by passing n and eT around but this code is just a demo of the algorithm, not a finalized product.
As mentioned before, this code only finds non-overlapping matches to W in T. In order to find all matches, we extend W with a virtual character x such that x never compares equal to any other byte value. This way, a match is handled like a failure to match the extended W at the last byte (virtual character x). The match is reported and the usual updates to T and i are done to find the next (possibly overlapping) match.
I ended up not using this algorithm at all for my find/replace tool. In any case, I think this was a good exercise in understanding the KMP algorithm
#include <stdio.h> #include <stdlib.h> #include <string.h> int kmpall(char *T, char *W, int (*report)(void *, char *, char *, size_t),void* arg) { char *eT, *oT; int n,z,i,*L,*Q; n= strlen(W); if (!n) return 0; L= calloc(n+1,sizeof(*L)); if (!L) return 1; for(i=2;i<=n;i++) { z= i-1; while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z]; if (W[i-1]==W[L[z]]) L[i]= L[z]+1; } Q= L; for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0; eT= T+strlen(T); oT= T; i= 0; again: if (T+n>eT) goto done; for(;i<n;i++) if (T[i]!=W[i]) goto fail; // match if (report && (*report)(arg, oT, W, T-oT)) goto done; advance: if (Q[i]) { T+= Q[i]; i-= Q[i]; } else { T+= i; i=0; } goto again; fail: if (!i) { T++; goto again; } goto advance; done: free(Q); return 0; } int prt(void *arg, char *T, char *W, size_t pos) { printf("found at %d\n", (int) pos); return 0; } int main() { static struct { char *T, *W; } TT[]= { { "baabbbaabbaabbbabaabbbaabaabababba", "baababa" }, { "AAAABAAAAABBBAAAAB", "AAAB", }, { "hayhello", "hell" }, { NULL, NULL } }; int i; for(i=0;TT[i].T;i++) kmpall(TT[i].T,TT[i].W, prt, NULL); return 0; }Since we know that i!=0 when there is a match (we checked for empty W at the beginning), we can safely move code part fail before advance.
#include <stdio.h> #include <stdlib.h> #include <string.h> int kmpall(char *T, char *W, int (*report)(void *, char *, char *, size_t),void* arg) { char *eT, *oT; int n,z,i,*L,*Q; n= strlen(W); if (!n) return 0; L= calloc(n+1,sizeof(*L)); if (!L) return 1; for(i=2;i<=n;i++) { z= i-1; while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z]; if (W[i-1]==W[L[z]]) L[i]= L[z]+1; } Q= L; for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0; eT= T+strlen(T); oT= T; i= 0; again: if (T+n>eT) goto done; for(;i<n;i++) if (T[i]!=W[i]) goto fail; // match if (report && (*report)(arg, oT, W, T-oT)) goto done; fail: if (!i) { T++; goto again; } goto advance; advance: if (Q[i]) { T+= Q[i]; i-= Q[i]; } else { T+= i; i=0; } goto again; done: free(Q); return 0; } int prt(void *arg, char *T, char *W, size_t pos) { printf("found at %d\n", (int) pos); return 0; } int main() { static struct { char *T, *W; } TT[]= { { "baabbbaabbaabbbabaabbbaabaabababba", "baababa" }, { "AAAABAAAAABBBAAAAB", "AAAB", }, { "hayhello", "hell" }, { NULL, NULL } }; int i; for(i=0;TT[i].T;i++) kmpall(TT[i].T,TT[i].W, prt, NULL); return 0; }Now the whole thing can be made into a loop, removing the labels.
#include <stdio.h> #include <stdlib.h> #include <string.h> int kmpall(char *T, char *W, int (*report)(void *, char *, char *, size_t),void* arg) { char *eT, *oT; int n,z,i,*L,*Q; n= strlen(W); if (!n) return 0; L= calloc(n+1,sizeof(*L)); if (!L) return 1; for(i=2;i<=n;i++) { z= i-1; while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z]; if (W[i-1]==W[L[z]]) L[i]= L[z]+1; } Q= L; for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0; eT= T+strlen(T); oT= T; i= 0; while(T+n<=eT) { for(;i<n;i++) if (T[i]!=W[i]) break; if (i==n && report && (*report)(arg, oT, W, T-oT)) break; if (!i) { T++; } else if (Q[i]) { T+= Q[i]; i-= Q[i]; } else { T+= i; i=0; } } free(Q); return 0; } int prt(void *arg, char *T, char *W, size_t pos) { printf("found at %d\n", (int) pos); return 0; } int main() { static struct { char *T, *W; } TT[]= { { "baabbbaabbaabbbabaabbbaabaabababba", "baababa" }, { "AAAABAAAAABBBAAAAB", "AAAB", }, { "hayhello", "hell" }, { NULL, NULL } }; int i; for(i=0;TT[i].T;i++) { printf("T: %s\nW: %s\n",TT[i].T,TT[i].W); kmpall(TT[i].T,TT[i].W, prt, NULL); } return 0; }We could also make use of the fact that Q[0]==0 in order to simplify the if statements regarding i and Q[i] in the matching part:
#include <stdio.h> #include <stdlib.h> #include <string.h> int kmpall(char *T, char *W, int (*report)(void *, char *, char *, size_t),void* arg) { char *eT, *oT; int n,z,i,*L,*Q; n= strlen(W); if (!n) return 0; L= calloc(n+1,sizeof(*L)); if (!L) return 1; for(i=2;i<=n;i++) { z= i-1; while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z]; if (W[i-1]==W[L[z]]) L[i]= L[z]+1; } Q= L; for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0; eT= T+strlen(T); oT= T; i= 0; while(T+n<=eT) { for(;i<n;i++) if (T[i]!=W[i]) break; if (i==n && report && (*report)(arg, oT, W, T-oT)) break; if (Q[i]) { T+= Q[i]; i-= Q[i]; } else { T+= i ? i : 1; i=0; } } free(Q); return 0; } int prt(void *arg, char *T, char *W, size_t pos) { printf("found at %d\n", (int) pos); return 0; } int main() { static struct { char *T, *W; } TT[]= { { "baabbbaabbaabbbabaabbbaabaabababba", "baababa" }, { "AAAABAAAAABBBAAAAB", "AAAB", }, { "hayhello", "hell" }, { NULL, NULL } }; int i; for(i=0;TT[i].T;i++) { printf("T: %s\nW: %s\n",TT[i].T,TT[i].W); kmpall(TT[i].T,TT[i].W, prt, NULL); } return 0; }