String Search

Well this is another case of "I was doing something else..". For small strings, strstr is enough but if you need to find a string in a file with a size of hundreds of megabytes, then you need some specialized tool, eh?

As any idiot will tell you, the first step to finding a string W in text T, is to find the first byte of W within T. That much we know.

uint8_t *find_str(uint8_t *T, size_t lenT, uint8_t *W, size_t lenW)
{
  uint8_t *F;
  ssize_t i;
restart_the_whole_thing:
  F= memchr(T, W[0], lenT);
  if (!F) return NULL;
  lenT -= F-T;
  if (lenT<lenW) return NULL;
  T= F;
Now, T[0]==W[0]. As long as T[i]==W[i], we're good. There is no way to skip this check, either.
  for(i=0;i<lenW;i++)
    if (W[i]!=T[i])
    {
fail:
      // DANGER!DANGER!
    }
   return T;
}
What you do in the case of a failure separates high school string search from PhD string search. The high school search goes like:
fail:
    T++; lenT--; goto restart_the_whole_thing; // yeah, right
For PhD search, there are several algorithms. The simplest-looking and most-sedgewick-like is Knuth-Morris-Pratt. There are a lot of people talking about this but I can't wrap my head around why the algorithm works.

Understanding it is crucial because I'm going to be using it in a context different than in-memory search. Specifically, I will use it to do search/replace in some huge files with some additional constraints.

Notation

In this page, I assume that T has been advanced so that T[0]==W[0]. Therefore, the algorithm I describe is a string matching algorithm after this increment. In my context, advancing T forward corresponds to copying some bytes from the source file to the output file.

All capital letter values below are strings or arrays. Indices start from 0. Negative indices are from the end of a string/array. i.e. W[-1] denotes the last byte of W. Wi denotes the first i bytes of W, i.e. W[0..i-1]. nW denotes the last n bytes. := is the assignment operator.

If W matches T, then it's always depicted at the end of T. I'm not interested in overlapping matches so the rest of the text isn't relevant. In the real implementation, T would be advanced by the length of the match and the search would restart.

Initial Thoughts

Let's assume that we have matched T[0..i-1] to W[0..i-1] so far, and that we have a mismatch at i.
    Ti = Wi          (1)
  T[i] ≠ W[i]        (2)
The trivial algorithm runs RESTART for all mismatches.
  RESTART: T+= 1, i= 0
  REPOS:   if T[∃minimal(k)]=W[0], T+=k,i=0 else fail
However, we could analyse W and predetermine if T should be incremented even more. For instance,
 i= 012345678
 T= ababababc
 W= ababc
Here, we would get a mismatch at i=4. Would it make sense to RESTART at this point? Of course not. We know that T[1]='b' and it won't match W[0]='a' anyway.

The main idea of the algorithm is to analyse W such that for each possible mismatch at i, T[i]≠W[i], determine how much T should be advanced and what value should be used for i from there on.

Now, let's think about how much T should be advanced. In order to advance T in a meaningful manner (i.e. not guaranteed to fail), we need to make sure that some bytes at the advanced position matches the first bytes of W. In the example above, advancing by 1 didn't work, but advancing by 2 would work:

   T[2]=W[0] and T[3]=W[1]
  
       due to (1),
   T[2]=W[2] and T[3]=W[3]

      therefore, the above observation is 
      really only about W:

   W[2]=W[0] and W[3]=W[1] 
Generalizing this strategy, we can say the following. If Wi contains a prefix of Wi somewhere other than position 0, then we can advance T to match that part of W. As a graphic:
  R= r0 r1 .. rn

  Wi=  r0 r1 .. rn ...
  Ti=  r0 r1 .. rn ...
    also,
  Wi= ...   r0 r1 .. rn ...
      ^--^ ^-----------^
        |       |        
        |       --- skip comparing these
        --- advance T by this amount
If we were to write it more formally:
  Wi = R P  = Q R S    P,Q,R≠''  (3)
Now, if no such R can be found, then it means that no prefix repeats in Wi. Therefore, W can not start within Ti. In this case, we would simply increment T by i and restart the search. So, from now on I will assume there is such a solution to (3) and that we know all solutions to it by some magic.

Before going further let's reason about S. If S is not empty, then W and T will look like this:

 W = R0 R1 .. Rr P0 P1 .. Pp ... 
 W = Q0 Q1 .. Qq R0 R1 .. Rr S0 S1 .. Ss ...
 T = Q0 Q1 .. Qq R0 R1 .. Rr S0 S1 .. Ss ...
After doing our updates to T and i:
 W = R0 R1 .. Rr P0 P1 .. Pp ... 
 T = R0 R1 .. Rr S0 S1 .. Ss ...
                 ^
                 |
                 --- i now points here
At this point, there are two possibilities:

S0 ≠ P0: For solutions ending up like this, the solution is no good because the update is guaranteed to cause a mismatch. Therefore, such solutions shall be discarded.

S0 = P0: For this case, there is another solution where

  R' = R S0
  P' = P[1..]
  Q' = Q
  S' = S[1..]

 Wi = R  P  =  Q R S
    = R P[0] P[1..] = Q R S[0] S[1..]
                 since P[0]=S[0]
    = R S[0] P[1..] = Q R S[0] S[1..]
    = R' P' = Q' R' S'
This means that, if we find a solution to (3) where S≠'', we can extend R to the right until S becomes empty. If extension doesn't work, then that solution was useless anyway.

These two cases result in the following fact.

If prefix R repeats within Wi, it has to reach the end of Wi in order to be useful. In other words, it also has to be a suffix of Wi.
Our partitioning has been simplified quite a bit now:
   Wi = R P  = Q R       (4)
Let's consider the case where there are at least two solutions to (4) with R values A and B. WLOG, let |A|>|B|.

Since A and B both match the initial part of Wi, B is a prefix of A. Last parts of A and B both match the last part of Wi, therefore B is also a suffix of A.

Since A is longer than B, it starts earlier as a suffix in Wi. If we chose B as our preferred R-value, then we would miss an earlier opportunity to match W.

  Wi= A PA = QA A 
  Wi= B PB = QB B
If we were to advance T by |QB|>|QA|, then we would skip over the initial part of A where it appears as a suffix. That is a legitimate potentially matching position because we know that T matches A. Therefore we mustn't use B. Doing this for all A,B pairs we realize that we must use only the longest R possible, for correctness.

Computing R

The actual values of R don't need to be stored because each R is a prefix of W and its contents can be retrieved from there if its length is known. Therefore, I will store only the length of R for each i in an array called L. I will still talk about R[i], but it won't be in the final code.

Since R[i] is the longest prefix/suffix in Wi, it relates to the bytes W[0] thru W[i-1]. L[i]= |R[i]|. Let's look at some initial values of L and R.

For i≥3, our computation will depend on R[i-1].

If R[i-1]='', then there is no prefix to which we can attach Wi-1 and make a solution for i. The only option is to start a new prefix with the value wi-1 and that can only happen if wi-1=w0. Otherwise, we have no solution.

  if R[i-1]=''
      if wi-1=w0
           R[i]:= wi-1
           L[i]:= 1
      else
           R[i]:= ''
           L[i]:= 0
      endif
  endif
On the other hand, if R[i-1]≠'', then we can try to construct R[i] by appending wi-1 to R[i-1]:
  R[i]= R[i-1] wi-1
We already know that R[i] is suffix of Wi because of the construction above and the fact that R[i-1] is a suffix of Wi-1. However, R[i] can be used only if it is also a prefix. So, wi-1 must match the first byte just after R within Wi. That position is given by |R[i-1]|=L[i-1].
   if R[i-1]≠''
        if wi-1=wL[i-1]
             R[i]:= R[i-1] wi-1
             L[i]:= |R[i]| = L[i-1]+1
        else
            it's complicated
        endif
   endif
After handling all these cases, we have arrived at the most interesting part of the algorithm. To summarize, we have a non-empty R for i-1, but it can't be extended by adding wi-1 to it.

We know that Ri is not longer than Ri-1 since we tried to construct such an Ri and failed. If Ri exists, then it must be an extension of a proper prefix of Ri-1. Let me put it this way. If

  Ri = Z wi-1
then |Z|<|Ri-1| because we tried Z=Ri-1 and failed. Now, if |Z|<|Ri-1| then Z must be a proper prefix of Ri-1 since they both match the first part of W, that is:
  Wi-1= Ri-1 Pi-1
    Wi= Z wi-1 Pi
Now, let's write down our facts and see what we can do.
Ri-1= Z Y
Ri= Z wi-1          
Wi= Qi Ri 
Wi= Qi Z wi-1  (replaced value of Ri)
Wi-1= Qi Z     (if we remove the last byte from Wi, we get Wi-1)
Wi-1= Qi-1 Ri-1 (partitioning for Wi-1)
    If we equate the last two expressions,

          Qi-1 Ri-1 = Qi Z
The last equation is a very important result. It tells us that Z is a suffix of Ri-1. Well, it was also a prefix of Ri-1. I wonder where I can find a substring Z of Ri-1, where Z is both a prefix and a suffix? Wait, I KNOW! We already calculated it. It's stored right there in
  Z=  R[ |Ri-1| ]
If Z turns out to be the empty string, then we have no solution. Therefore we proceed with setting R[i]='' and so on.

Otherwise, now we use Z as a basis to construct R[i] by appending wi-1 to it. We know that the suffix condition holds because we just constructed it with the last byte. However, the prefix condition also has to hold:

   Ri= Z wi-1
   Wi= Ri Pi
   Wi= Z wi-1 Pi
       therefore,
   wi-1 must equal w|Z|
Here, Z is guaranteed to be the best we can do because it was computed by the same algorithm.

The behaviour of the algorithm at this point is identical to the one we did before, while extending Ri-1 to get Ri. If the prefix condition fails, our only solution exists within the set of prefix/suffix substrings of Z and so on. Let's write down the whole code before we can simplify it:

  if R[i-1]=''
      if wi-1=w0
           R[i]:= wi-1
           L[i]:= 1
      else
           R[i]:= ''
           L[i]:= 0
      endif
  endif
   if R[i-1]≠''
        if wi-1=wL[i-1]
             R[i]:= R[i-1] wi-1
             L[i]:= |R[i]| = L[i-1]+1
        else
            it's complicated
            Z:= R[ |R[i-1]| ]
            z:= |Z|= L[ |R[i-1]| ]= L[ L[i-1] ]
          try_again:
            if Z≠''
               if wi-1=wz
                  R[i]:= Z wi-1
                  L[i]:= |R[i]|= |Z| + 1= z+1
               else
                  Z:= R[z]
                  z:= |Z|= L[z]
                  goto try_again
               endif
            else
               no solution, do it like at the beginning
               R[i]:= ''
               L[i]:= 0
            endif 
        endif
   endif
Now that we're finished with R, we can totally remove it from the code and use only L.
  if L[i-1]=0
      if wi-1=w0
           L[i]:= 1
      else
           L[i]:= 0
      endif
  endif
   if L[i-1]≠0
        if wi-1=wL[i-1]
             L[i]:= L[i-1]+1
        else
            it's complicated
            z:= L[ L[i-1] ]
          try_again:
            if z≠0
               if wi-1=wz
                  L[i]:= z+1
               else
                  z:= L[z]
                  goto try_again
               endif
            else
               no solution, do it like at the beginning
               L[i]:= 0
            endif 
        endif
   endif
Since the outmost conditions are mutually exclusive, we can re-order them and put them into a single if-else.
  if L[i-1]≠0
        if wi-1=wL[i-1]
             L[i]:= L[i-1]+1
        else
            it's complicated
            z:= L[ L[i-1] ]
          try_again:
            if z≠0
               if wi-1=wz
                  L[i]:= z+1
               else
                  z:= L[z]
                  goto try_again
               endif
            else
               no solution, do it like at the beginning
               L[i]:= 0
            endif 
        endif
  else
      if wi-1=w0
           L[i]:= 1
      else
           L[i]:= 0
      endif
  endif
Now we can combine the two places where an extension is done by assigning i-1 to z initially.
  z:= i-1
  while L[z]≠0
    if wi-1=wL[z]
         L[i]:= L[z]+1
         break while loop
    else
         z:= L[z]
    endif
  if L[z]=0
     if wi-1=w0
          L[i]:= 1
     else
          L[i]:= 0
  endif
Prettifying it further:
  z:= i-1
  while L[z]≠0 and wi-1≠wL[z], z:= L[z]
  if L[z]=0
     L[i]:= wi-1=w0 ? 1:0
  else
     L[i]:= L[z]+1
  endif
In the last part, we can change the if-condition to get rid of the conditional expression:
  z:= i-1
  while L[z]≠0 and wi-1≠wL[z], z:= L[z]
  if wi-1=wL[z]
     L[i]:= L[z] + 1
  else
     L[i]:= 0
  endif

Putting it Together

In the above section, we calculate R and L, but they aren't really necessary for the original algorithm. All we need is |Qi|.

When we advance T by |Qi|, we also need to decrement i by the same amount so that the comparison can continue beyond Ri.

  Wi= Qi Ri 
  |Wi|= i = |Qi| + |Ri|
  Q[i]= i - R[i], if R[i]≠0
  Q[i]= 0       , if R[i]=0
For R[i]=0, I will use the sentinel value Q[i]=0. So now, I guess we're doing this:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int *compute_Q(char *W)
{
  int n,z,i,*L,*Q;
  n= strlen(W);
  L= calloc(n,sizeof(*L));
  for(i=2;i<n;i++)
  {
    z= i-1;
    while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
    if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
    // calloc makes it zero otherwise
  } 
  Q= L; // re-use the space
  for(i=0;i<n;i++) Q[i]= L[i]? i-L[i] : 0;
  return Q;
}


char *kmp(char *T, char *W,int *Q)
{
  int i,n;
  char *eT;
  eT= T+strlen(T);
  i= 0;
  n= strlen(W);
  if (T+n>eT) return NULL;
  while(1)
  {
    if (T[i]==W[i]) { i++; if (i==n) return T; continue; }
    if (!i) { T= strchr(T+1, W[0]); if (!T) return NULL; }
       else if (Q[i]) { T+= Q[i]; i-= Q[i]; }
                 else { T+= i; i= 0; }
    if (T+n>eT) return NULL;
  }
}

void kmptest(char *T, char *W)
{
  char *oT=T;
  int *Q, n;
  Q= compute_Q(W);
  n= strlen(W);
  printf("T: %s\nW: %s\n",T,W);
  T-= n;
  while ((T=kmp(T+n,W,Q))) printf("found at %d\n", T-oT); 
  free(Q);
}

struct {
  char *T, *W;
} TT[]= {
  { "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
  { "AAAABAAAAABBBAAAAB", "AAAB", },
  { "hayhello", "hell" },
  { NULL, NULL }
};

int main()
{
  int i;
  for(i=0;TT[i].T;i++) kmptest(TT[i].T,TT[i].W);
  return 0;
}
The above implementation isn't really O(|W|+|T|) but O(|T|2+|W|*|T|) because of the heavy usage of strlen. This can be fixed easily by passing n and eT around but this code is just a demo of the algorithm, not a finalized product.

As mentioned before, this code only finds non-overlapping matches to W in T. In order to find all matches, we extend W with a virtual character x such that x never compares equal to any other byte value. This way, a match is handled like a failure to match the extended W at the last byte (virtual character x). The match is reported and the usual updates to T and i are done to find the next (possibly overlapping) match.

I ended up not using this algorithm at all for my find/replace tool. In any case, I think this was a good exercise in understanding the KMP algorithm

Fixing the Rest

OK, let's do a proper job of this. Here's the fixed version with overlapping matches enabled, with the correct number of strlen calls.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int kmpall(char *T, char *W,
          int (*report)(void *, char *, char *, size_t),void* arg)
{
  char *eT, *oT;
  int n,z,i,*L,*Q;

  n= strlen(W);
  if (!n) return 0;
  L= calloc(n+1,sizeof(*L));
  if (!L) return 1;
  for(i=2;i<=n;i++)
  {
    z= i-1;
    while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
    if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
  } 
  Q= L; 
  for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0;

  eT= T+strlen(T);
  oT= T;
  i= 0;
again:
  if (T+n>eT) goto done;
  for(;i<n;i++) if (T[i]!=W[i]) goto fail;

  // match
  if (report && (*report)(arg, oT, W, T-oT)) goto done;
advance:
  if (Q[i]) { T+= Q[i]; i-= Q[i]; } 
       else { T+= i; i=0; }
  goto again;
fail:
  if (!i) { T++; goto again; }
  goto advance;
done:
  free(Q);
  return 0;
}

int prt(void *arg, char *T, char *W, size_t pos)
{
  printf("found at %d\n", (int) pos);
  return 0;
}

int main()
{
  static struct {
     char *T, *W;
  } TT[]= {
  { "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
  { "AAAABAAAAABBBAAAAB", "AAAB", },
  { "hayhello", "hell" },
  { NULL, NULL }
  };
  int i;
  for(i=0;TT[i].T;i++) kmpall(TT[i].T,TT[i].W, prt, NULL);
  return 0;
}
Since we know that i!=0 when there is a match (we checked for empty W at the beginning), we can safely move code part fail before advance.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int kmpall(char *T, char *W,
          int (*report)(void *, char *, char *, size_t),void* arg)
{
  char *eT, *oT;
  int n,z,i,*L,*Q;

  n= strlen(W);
  if (!n) return 0;
  L= calloc(n+1,sizeof(*L));
  if (!L) return 1;
  for(i=2;i<=n;i++)
  {
    z= i-1;
    while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
    if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
  } 
  Q= L; 
  for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0;

  eT= T+strlen(T);
  oT= T;
  i= 0;
again:
  if (T+n>eT) goto done;
  for(;i<n;i++) if (T[i]!=W[i]) goto fail;

  // match
  if (report && (*report)(arg, oT, W, T-oT)) goto done;
fail:
  if (!i) { T++; goto again; }
  goto advance;
advance:
  if (Q[i]) { T+= Q[i]; i-= Q[i]; } 
       else { T+= i; i=0; }
  goto again;
done:
  free(Q);
  return 0;
}

int prt(void *arg, char *T, char *W, size_t pos)
{
  printf("found at %d\n", (int) pos);
  return 0;
}

int main()
{
  static struct {
     char *T, *W;
  } TT[]= {
  { "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
  { "AAAABAAAAABBBAAAAB", "AAAB", },
  { "hayhello", "hell" },
  { NULL, NULL }
  };
  int i;
  for(i=0;TT[i].T;i++) kmpall(TT[i].T,TT[i].W, prt, NULL);
  return 0;
}
Now the whole thing can be made into a loop, removing the labels.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int kmpall(char *T, char *W,
          int (*report)(void *, char *, char *, size_t),void* arg)
{
  char *eT, *oT;
  int n,z,i,*L,*Q;

  n= strlen(W);
  if (!n) return 0;
  L= calloc(n+1,sizeof(*L));
  if (!L) return 1;
  for(i=2;i<=n;i++)
  {
    z= i-1;
    while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
    if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
  } 
  Q= L; 
  for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0;

  eT= T+strlen(T);
  oT= T;
  i= 0;
  while(T+n<=eT)
  {
    for(;i<n;i++) if (T[i]!=W[i]) break;
    if (i==n && report && (*report)(arg, oT, W, T-oT)) break;
    if (!i) { T++; } 
    else if (Q[i]) { T+= Q[i]; i-= Q[i]; } 
       else { T+= i; i=0; }
  }
  free(Q);
  return 0;
}

int prt(void *arg, char *T, char *W, size_t pos)
{
  printf("found at %d\n", (int) pos);
  return 0;
}

int main()
{
  static struct {
     char *T, *W;
  } TT[]= {
  { "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
  { "AAAABAAAAABBBAAAAB", "AAAB", },
  { "hayhello", "hell" },
  { NULL, NULL }
  };
  int i;
  for(i=0;TT[i].T;i++) 
  {
    printf("T: %s\nW: %s\n",TT[i].T,TT[i].W);
    kmpall(TT[i].T,TT[i].W, prt, NULL);
  }
  return 0;
}
We could also make use of the fact that Q[0]==0 in order to simplify the if statements regarding i and Q[i] in the matching part:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int kmpall(char *T, char *W,
          int (*report)(void *, char *, char *, size_t),void* arg)
{
  char *eT, *oT;
  int n,z,i,*L,*Q;

  n= strlen(W);
  if (!n) return 0;
  L= calloc(n+1,sizeof(*L));
  if (!L) return 1;
  for(i=2;i<=n;i++)
  {
    z= i-1;
    while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
    if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
  } 
  Q= L; 
  for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0;

  eT= T+strlen(T);
  oT= T;
  i= 0;
  while(T+n<=eT)
  {
    for(;i<n;i++) if (T[i]!=W[i]) break;
    if (i==n && report && (*report)(arg, oT, W, T-oT)) break;
    if (Q[i]) { T+= Q[i]; i-= Q[i]; } 
         else { T+= i ? i : 1; i=0; }
  }
  free(Q);
  return 0;
}

int prt(void *arg, char *T, char *W, size_t pos)
{
  printf("found at %d\n", (int) pos);
  return 0;
}

int main()
{
  static struct {
     char *T, *W;
  } TT[]= {
  { "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
  { "AAAABAAAAABBBAAAAB", "AAAB", },
  { "hayhello", "hell" },
  { NULL, NULL }
  };
  int i;
  for(i=0;TT[i].T;i++) 
  {
    printf("T: %s\nW: %s\n",TT[i].T,TT[i].W);
    kmpall(TT[i].T,TT[i].W, prt, NULL);
  }
  return 0;
}

Links

Here are links to some inferior pages talking about this :P