# String Search

Well this is another case of "I was doing something else..". For small strings, strstr is enough but if you need to find a string in a file with a size of hundreds of megabytes, then you need some specialized tool, eh?

As any idiot will tell you, the first step to finding a string W in text T, is to find the first byte of W within T. That much we know.

```uint8_t *find_str(uint8_t *T, size_t lenT, uint8_t *W, size_t lenW)
{
uint8_t *F;
ssize_t i;
restart_the_whole_thing:
F= memchr(T, W[0], lenT);
if (!F) return NULL;
lenT -= F-T;
if (lenT<lenW) return NULL;
T= F;
```
Now, T[0]==W[0]. As long as T[i]==W[i], we're good. There is no way to skip this check, either.
```  for(i=0;i<lenW;i++)
if (W[i]!=T[i])
{
fail:
// DANGER!DANGER!
}
return T;
}
```
What you do in the case of a failure separates high school string search from PhD string search. The high school search goes like:
```fail:
T++; lenT--; goto restart_the_whole_thing; // yeah, right
```
For PhD search, there are several algorithms. The simplest-looking and most-sedgewick-like is Knuth-Morris-Pratt. There are a lot of people talking about this but I can't wrap my head around why the algorithm works.

Understanding it is crucial because I'm going to be using it in a context different than in-memory search. Specifically, I will use it to do search/replace in some huge files with some additional constraints.

### Notation

In this page, I assume that T has been advanced so that T[0]==W[0]. Therefore, the algorithm I describe is a string matching algorithm after this increment. In my context, advancing T forward corresponds to copying some bytes from the source file to the output file.

All capital letter values below are strings or arrays. Indices start from 0. Negative indices are from the end of a string/array. i.e. W[-1] denotes the last byte of W. Wi denotes the first i bytes of W, i.e. W[0..i-1]. nW denotes the last n bytes. := is the assignment operator.

If W matches T, then it's always depicted at the end of T. I'm not interested in overlapping matches so the rest of the text isn't relevant. In the real implementation, T would be advanced by the length of the match and the search would restart.

## Initial Thoughts

Let's assume that we have matched T[0..i-1] to W[0..i-1] so far, and that we have a mismatch at i.
```    Ti = Wi          (1)
T[i] ≠ W[i]        (2)
```
The trivial algorithm runs RESTART for all mismatches.
```  RESTART: T+= 1, i= 0
REPOS:   if T[∃minimal(k)]=W[0], T+=k,i=0 else fail
```
However, we could analyse W and predetermine if T should be incremented even more. For instance,
``` i= 012345678
T= ababababc
W= ababc
```
Here, we would get a mismatch at i=4. Would it make sense to RESTART at this point? Of course not. We know that T[1]='b' and it won't match W[0]='a' anyway.

The main idea of the algorithm is to analyse W such that for each possible mismatch at i, T[i]≠W[i], determine how much T should be advanced and what value should be used for i from there on.

Now, let's think about how much T should be advanced. In order to advance T in a meaningful manner (i.e. not guaranteed to fail), we need to make sure that some bytes at the advanced position matches the first bytes of W. In the example above, advancing by 1 didn't work, but advancing by 2 would work:

```   T[2]=W[0] and T[3]=W[1]

due to (1),
T[2]=W[2] and T[3]=W[3]

therefore, the above observation is

W[2]=W[0] and W[3]=W[1]
```
Generalizing this strategy, we can say the following. If Wi contains a prefix of Wi somewhere other than position 0, then we can advance T to match that part of W. As a graphic:
```  R= r0 r1 .. rn

Wi=  r0 r1 .. rn ...
Ti=  r0 r1 .. rn ...
also,
Wi= ...   r0 r1 .. rn ...
^--^ ^-----------^
|       |
|       --- skip comparing these
--- advance T by this amount
```
If we were to write it more formally:
```  Wi = R P  = Q R S    P,Q,R≠''  (3)
```
• We advance T by |Q|.
• Now T|R|=W|R| since we already did those comparisons before.
• Therefore, the new value for i is |R|.
Now, if no such R can be found, then it means that no prefix repeats in Wi. Therefore, W can not start within Ti. In this case, we would simply increment T by i and restart the search. So, from now on I will assume there is such a solution to (3) and that we know all solutions to it by some magic.

Before going further let's reason about S. If S is not empty, then W and T will look like this:

``` W = R0 R1 .. Rr P0 P1 .. Pp ...
W = Q0 Q1 .. Qq R0 R1 .. Rr S0 S1 .. Ss ...
T = Q0 Q1 .. Qq R0 R1 .. Rr S0 S1 .. Ss ...
```
After doing our updates to T and i:
``` W = R0 R1 .. Rr P0 P1 .. Pp ...
T = R0 R1 .. Rr S0 S1 .. Ss ...
^
|
--- i now points here
```
At this point, there are two possibilities:

S0 ≠ P0: For solutions ending up like this, the solution is no good because the update is guaranteed to cause a mismatch. Therefore, such solutions shall be discarded.

S0 = P0: For this case, there is another solution where

```  R' = R S0
P' = P[1..]
Q' = Q
S' = S[1..]

Wi = R  P  =  Q R S
= R P[0] P[1..] = Q R S[0] S[1..]
since P[0]=S[0]
= R S[0] P[1..] = Q R S[0] S[1..]
= R' P' = Q' R' S'
```
This means that, if we find a solution to (3) where S≠'', we can extend R to the right until S becomes empty. If extension doesn't work, then that solution was useless anyway.

These two cases result in the following fact.

If prefix R repeats within Wi, it has to reach the end of Wi in order to be useful. In other words, it also has to be a suffix of Wi.
Our partitioning has been simplified quite a bit now:
```   Wi = R P  = Q R       (4)
```
Let's consider the case where there are at least two solutions to (4) with R values A and B. WLOG, let |A|>|B|.

Since A and B both match the initial part of Wi, B is a prefix of A. Last parts of A and B both match the last part of Wi, therefore B is also a suffix of A.

Since A is longer than B, it starts earlier as a suffix in Wi. If we chose B as our preferred R-value, then we would miss an earlier opportunity to match W.

```  Wi= A PA = QA A
Wi= B PB = QB B
```
If we were to advance T by |QB|>|QA|, then we would skip over the initial part of A where it appears as a suffix. That is a legitimate potentially matching position because we know that T matches A. Therefore we mustn't use B. Doing this for all A,B pairs we realize that we must use only the longest R possible, for correctness.

## Computing R

The actual values of R don't need to be stored because each R is a prefix of W and its contents can be retrieved from there if its length is known. Therefore, I will store only the length of R for each i in an array called L. I will still talk about R[i], but it won't be in the final code.

Since R[i] is the longest prefix/suffix in Wi, it relates to the bytes W[0] thru W[i-1]. L[i]= |R[i]|. Let's look at some initial values of L and R.

• For i=0, Wi='', therefore L[0]:= 0.
• For i=1, Wi=w0 -a single byte- therefore no R exists and L[1]:= 0.
• For i=2, Wi=w0w1. If w0=w1, then L[2]:=|R[2]=w0|= 1. Otherwise L[2]:=0.
For i≥3, our computation will depend on R[i-1].

If R[i-1]='', then there is no prefix to which we can attach Wi-1 and make a solution for i. The only option is to start a new prefix with the value wi-1 and that can only happen if wi-1=w0. Otherwise, we have no solution.

```  if R[i-1]=''
if wi-1=w0
R[i]:= wi-1
L[i]:= 1
else
R[i]:= ''
L[i]:= 0
endif
endif
```
On the other hand, if R[i-1]≠'', then we can try to construct R[i] by appending wi-1 to R[i-1]:
```  R[i]= R[i-1] wi-1
```
We already know that R[i] is suffix of Wi because of the construction above and the fact that R[i-1] is a suffix of Wi-1. However, R[i] can be used only if it is also a prefix. So, wi-1 must match the first byte just after R within Wi. That position is given by |R[i-1]|=L[i-1].
```   if R[i-1]≠''
if wi-1=wL[i-1]
R[i]:= R[i-1] wi-1
L[i]:= |R[i]| = L[i-1]+1
else
it's complicated
endif
endif
```
After handling all these cases, we have arrived at the most interesting part of the algorithm. To summarize, we have a non-empty R for i-1, but it can't be extended by adding wi-1 to it.

We know that Ri is not longer than Ri-1 since we tried to construct such an Ri and failed. If Ri exists, then it must be an extension of a proper prefix of Ri-1. Let me put it this way. If

```  Ri = Z wi-1
```
then |Z|<|Ri-1| because we tried Z=Ri-1 and failed. Now, if |Z|<|Ri-1| then Z must be a proper prefix of Ri-1 since they both match the first part of W, that is:
```  Wi-1= Ri-1 Pi-1
Wi= Z wi-1 Pi
```
Now, let's write down our facts and see what we can do.
```Ri-1= Z Y
Ri= Z wi-1
Wi= Qi Ri
Wi= Qi Z wi-1  (replaced value of Ri)
Wi-1= Qi Z     (if we remove the last byte from Wi, we get Wi-1)
Wi-1= Qi-1 Ri-1 (partitioning for Wi-1)
If we equate the last two expressions,

Qi-1 Ri-1 = Qi Z
```
The last equation is a very important result. It tells us that Z is a suffix of Ri-1. Well, it was also a prefix of Ri-1. I wonder where I can find a substring Z of Ri-1, where Z is both a prefix and a suffix? Wait, I KNOW! We already calculated it. It's stored right there in
```  Z=  R[ |Ri-1| ]
```
If Z turns out to be the empty string, then we have no solution. Therefore we proceed with setting R[i]='' and so on.

Otherwise, now we use Z as a basis to construct R[i] by appending wi-1 to it. We know that the suffix condition holds because we just constructed it with the last byte. However, the prefix condition also has to hold:

```   Ri= Z wi-1
Wi= Ri Pi
Wi= Z wi-1 Pi
therefore,
wi-1 must equal w|Z|
```
Here, Z is guaranteed to be the best we can do because it was computed by the same algorithm.

The behaviour of the algorithm at this point is identical to the one we did before, while extending Ri-1 to get Ri. If the prefix condition fails, our only solution exists within the set of prefix/suffix substrings of Z and so on. Let's write down the whole code before we can simplify it:

```  if R[i-1]=''
if wi-1=w0
R[i]:= wi-1
L[i]:= 1
else
R[i]:= ''
L[i]:= 0
endif
endif
if R[i-1]≠''
if wi-1=wL[i-1]
R[i]:= R[i-1] wi-1
L[i]:= |R[i]| = L[i-1]+1
else
it's complicated
Z:= R[ |R[i-1]| ]
z:= |Z|= L[ |R[i-1]| ]= L[ L[i-1] ]
try_again:
if Z≠''
if wi-1=wz
R[i]:= Z wi-1
L[i]:= |R[i]|= |Z| + 1= z+1
else
Z:= R[z]
z:= |Z|= L[z]
goto try_again
endif
else
no solution, do it like at the beginning
R[i]:= ''
L[i]:= 0
endif
endif
endif
```
Now that we're finished with R, we can totally remove it from the code and use only L.
```  if L[i-1]=0
if wi-1=w0
L[i]:= 1
else
L[i]:= 0
endif
endif
if L[i-1]≠0
if wi-1=wL[i-1]
L[i]:= L[i-1]+1
else
it's complicated
z:= L[ L[i-1] ]
try_again:
if z≠0
if wi-1=wz
L[i]:= z+1
else
z:= L[z]
goto try_again
endif
else
no solution, do it like at the beginning
L[i]:= 0
endif
endif
endif
```
Since the outmost conditions are mutually exclusive, we can re-order them and put them into a single if-else.
```  if L[i-1]≠0
if wi-1=wL[i-1]
L[i]:= L[i-1]+1
else
it's complicated
z:= L[ L[i-1] ]
try_again:
if z≠0
if wi-1=wz
L[i]:= z+1
else
z:= L[z]
goto try_again
endif
else
no solution, do it like at the beginning
L[i]:= 0
endif
endif
else
if wi-1=w0
L[i]:= 1
else
L[i]:= 0
endif
endif
```
Now we can combine the two places where an extension is done by assigning i-1 to z initially.
```  z:= i-1
while L[z]≠0
if wi-1=wL[z]
L[i]:= L[z]+1
break while loop
else
z:= L[z]
endif
if L[z]=0
if wi-1=w0
L[i]:= 1
else
L[i]:= 0
endif
```
Prettifying it further:
```  z:= i-1
while L[z]≠0 and wi-1≠wL[z], z:= L[z]
if L[z]=0
L[i]:= wi-1=w0 ? 1:0
else
L[i]:= L[z]+1
endif
```
In the last part, we can change the if-condition to get rid of the conditional expression:
```  z:= i-1
while L[z]≠0 and wi-1≠wL[z], z:= L[z]
if wi-1=wL[z]
L[i]:= L[z] + 1
else
L[i]:= 0
endif
```

## Putting it Together

In the above section, we calculate R and L, but they aren't really necessary for the original algorithm. All we need is |Qi|.

When we advance T by |Qi|, we also need to decrement i by the same amount so that the comparison can continue beyond Ri.

```  Wi= Qi Ri
|Wi|= i = |Qi| + |Ri|
Q[i]= i - R[i], if R[i]≠0
Q[i]= 0       , if R[i]=0
```
For R[i]=0, I will use the sentinel value Q[i]=0. So now, I guess we're doing this:
```#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int *compute_Q(char *W)
{
int n,z,i,*L,*Q;
n= strlen(W);
L= calloc(n,sizeof(*L));
for(i=2;i<n;i++)
{
z= i-1;
while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
// calloc makes it zero otherwise
}
Q= L; // re-use the space
for(i=0;i<n;i++) Q[i]= L[i]? i-L[i] : 0;
return Q;
}

char *kmp(char *T, char *W,int *Q)
{
int i,n;
char *eT;
eT= T+strlen(T);
i= 0;
n= strlen(W);
if (T+n>eT) return NULL;
while(1)
{
if (T[i]==W[i]) { i++; if (i==n) return T; continue; }
if (!i) { T= strchr(T+1, W[0]); if (!T) return NULL; }
else if (Q[i]) { T+= Q[i]; i-= Q[i]; }
else { T+= i; i= 0; }
if (T+n>eT) return NULL;
}
}

void kmptest(char *T, char *W)
{
char *oT=T;
int *Q, n;
Q= compute_Q(W);
n= strlen(W);
printf("T: %s\nW: %s\n",T,W);
T-= n;
while ((T=kmp(T+n,W,Q))) printf("found at %d\n", T-oT);
free(Q);
}

struct {
char *T, *W;
} TT[]= {
{ "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
{ "AAAABAAAAABBBAAAAB", "AAAB", },
{ "hayhello", "hell" },
{ NULL, NULL }
};

int main()
{
int i;
for(i=0;TT[i].T;i++) kmptest(TT[i].T,TT[i].W);
return 0;
}
```
The above implementation isn't really O(|W|+|T|) but O(|T|2+|W|*|T|) because of the heavy usage of strlen. This can be fixed easily by passing n and eT around but this code is just a demo of the algorithm, not a finalized product.

As mentioned before, this code only finds non-overlapping matches to W in T. In order to find all matches, we extend W with a virtual character x such that x never compares equal to any other byte value. This way, a match is handled like a failure to match the extended W at the last byte (virtual character x). The match is reported and the usual updates to T and i are done to find the next (possibly overlapping) match.

I ended up not using this algorithm at all for my find/replace tool. In any case, I think this was a good exercise in understanding the KMP algorithm

## Fixing the Rest

OK, let's do a proper job of this. Here's the fixed version with overlapping matches enabled, with the correct number of strlen calls.
```#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int kmpall(char *T, char *W,
int (*report)(void *, char *, char *, size_t),void* arg)
{
char *eT, *oT;
int n,z,i,*L,*Q;

n= strlen(W);
if (!n) return 0;
L= calloc(n+1,sizeof(*L));
if (!L) return 1;
for(i=2;i<=n;i++)
{
z= i-1;
while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
}
Q= L;
for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0;

eT= T+strlen(T);
oT= T;
i= 0;
again:
if (T+n>eT) goto done;
for(;i<n;i++) if (T[i]!=W[i]) goto fail;

// match
if (report && (*report)(arg, oT, W, T-oT)) goto done;
if (Q[i]) { T+= Q[i]; i-= Q[i]; }
else { T+= i; i=0; }
goto again;
fail:
if (!i) { T++; goto again; }
done:
free(Q);
return 0;
}

int prt(void *arg, char *T, char *W, size_t pos)
{
printf("found at %d\n", (int) pos);
return 0;
}

int main()
{
static struct {
char *T, *W;
} TT[]= {
{ "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
{ "AAAABAAAAABBBAAAAB", "AAAB", },
{ "hayhello", "hell" },
{ NULL, NULL }
};
int i;
for(i=0;TT[i].T;i++) kmpall(TT[i].T,TT[i].W, prt, NULL);
return 0;
}
```
Since we know that i!=0 when there is a match (we checked for empty W at the beginning), we can safely move code part fail before advance.
```#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int kmpall(char *T, char *W,
int (*report)(void *, char *, char *, size_t),void* arg)
{
char *eT, *oT;
int n,z,i,*L,*Q;

n= strlen(W);
if (!n) return 0;
L= calloc(n+1,sizeof(*L));
if (!L) return 1;
for(i=2;i<=n;i++)
{
z= i-1;
while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
}
Q= L;
for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0;

eT= T+strlen(T);
oT= T;
i= 0;
again:
if (T+n>eT) goto done;
for(;i<n;i++) if (T[i]!=W[i]) goto fail;

// match
if (report && (*report)(arg, oT, W, T-oT)) goto done;
fail:
if (!i) { T++; goto again; }
if (Q[i]) { T+= Q[i]; i-= Q[i]; }
else { T+= i; i=0; }
goto again;
done:
free(Q);
return 0;
}

int prt(void *arg, char *T, char *W, size_t pos)
{
printf("found at %d\n", (int) pos);
return 0;
}

int main()
{
static struct {
char *T, *W;
} TT[]= {
{ "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
{ "AAAABAAAAABBBAAAAB", "AAAB", },
{ "hayhello", "hell" },
{ NULL, NULL }
};
int i;
for(i=0;TT[i].T;i++) kmpall(TT[i].T,TT[i].W, prt, NULL);
return 0;
}
```
Now the whole thing can be made into a loop, removing the labels.
```#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int kmpall(char *T, char *W,
int (*report)(void *, char *, char *, size_t),void* arg)
{
char *eT, *oT;
int n,z,i,*L,*Q;

n= strlen(W);
if (!n) return 0;
L= calloc(n+1,sizeof(*L));
if (!L) return 1;
for(i=2;i<=n;i++)
{
z= i-1;
while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
}
Q= L;
for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0;

eT= T+strlen(T);
oT= T;
i= 0;
while(T+n<=eT)
{
for(;i<n;i++) if (T[i]!=W[i]) break;
if (i==n && report && (*report)(arg, oT, W, T-oT)) break;
if (!i) { T++; }
else if (Q[i]) { T+= Q[i]; i-= Q[i]; }
else { T+= i; i=0; }
}
free(Q);
return 0;
}

int prt(void *arg, char *T, char *W, size_t pos)
{
printf("found at %d\n", (int) pos);
return 0;
}

int main()
{
static struct {
char *T, *W;
} TT[]= {
{ "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
{ "AAAABAAAAABBBAAAAB", "AAAB", },
{ "hayhello", "hell" },
{ NULL, NULL }
};
int i;
for(i=0;TT[i].T;i++)
{
printf("T: %s\nW: %s\n",TT[i].T,TT[i].W);
kmpall(TT[i].T,TT[i].W, prt, NULL);
}
return 0;
}
```
We could also make use of the fact that Q[0]==0 in order to simplify the if statements regarding i and Q[i] in the matching part:
```#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int kmpall(char *T, char *W,
int (*report)(void *, char *, char *, size_t),void* arg)
{
char *eT, *oT;
int n,z,i,*L,*Q;

n= strlen(W);
if (!n) return 0;
L= calloc(n+1,sizeof(*L));
if (!L) return 1;
for(i=2;i<=n;i++)
{
z= i-1;
while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
}
Q= L;
for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0;

eT= T+strlen(T);
oT= T;
i= 0;
while(T+n<=eT)
{
for(;i<n;i++) if (T[i]!=W[i]) break;
if (i==n && report && (*report)(arg, oT, W, T-oT)) break;
if (Q[i]) { T+= Q[i]; i-= Q[i]; }
else { T+= i ? i : 1; i=0; }
}
free(Q);
return 0;
}

int prt(void *arg, char *T, char *W, size_t pos)
{
printf("found at %d\n", (int) pos);
return 0;
}

int main()
{
static struct {
char *T, *W;
} TT[]= {
{ "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
{ "AAAABAAAAABBBAAAAB", "AAAB", },
{ "hayhello", "hell" },
{ NULL, NULL }
};
int i;
for(i=0;TT[i].T;i++)
{
printf("T: %s\nW: %s\n",TT[i].T,TT[i].W);
kmpall(TT[i].T,TT[i].W, prt, NULL);
}
return 0;
}
```