String Search

Well this is another case of "I was doing something else..". For small strings, strstr is enough but if you need to find a string in a file with a size of hundreds of megabytes, then you need some specialized tool, eh?

As any idiot will tell you, the first step to finding a string W in text T, is to find the first byte of W within T. That much we know.

uint8_t *find_str(uint8_t *T, size_t lenT, uint8_t *W, size_t lenW)
{
  uint8_t *F;
  ssize_t i;
restart_the_whole_thing:
  F= memchr(T, W[0], lenT);
  if (!F) return NULL;
  lenT -= F-T;
  if (lenT<lenW) return NULL;
  T= F;

Now, T[0]==W[0]. As long as T[i]==W[i], we're good. There is no way to skip this check, either.

  for(i=0;i<lenW;i++)
    if (W[i]!=T[i])
    {
fail:
      // DANGER!DANGER!
    }
   return T;
}

What you do in the case of a failure separates high school string search from PhD string search. The high school search goes like:

fail:
    T++; lenT--; goto restart_the_whole_thing; // yeah, right

For PhD search, there are several algorithms. The simplest-looking and most-sedgewick-like is Knuth-Morris-Pratt. There are a lot of people talking about this but I can't wrap my head around why the algorithm works.

Understanding it is crucial because I'm going to be using it in a context different than in-memory search. Specifically, I will use it to do search/replace in some huge files with some additional constraints.

Notation

In this page, I assume that T has been advanced so that T[0]==W[0]. Therefore, the algorithm I describe is a string matching algorithm after this increment. In my context, advancing T forward corresponds to copying some bytes from the source file to the output file.

All capital letter values below are strings or arrays. Indices start from 0. Negative indices are from the end of a string/array. i.e. W[-1] denotes the last byte of W. Wⁱ denotes the first i bytes of W, i.e. W[0..i-1]. ⁿW denotes the last n bytes. := is the assignment operator.

If W matches T, then it's always depicted at the end of T. I'm not interested in overlapping matches so the rest of the text isn't relevant. In the real implementation, T would be advanced by the length of the match and the search would restart.

Initial Thoughts

Let's assume that we have matched T[0..i-1] to W[0..i-1] so far, and that we have a mismatch at i.

    Tⁱ = Wⁱ          (1)
  T[i] ≠ W[i]        (2)

The trivial algorithm runs RESTART for all mismatches.

  RESTART: T+= 1, i= 0
  REPOS:   if T[∃minimal(k)]=W[0], T+=k,i=0 else fail

However, we could analyse W and predetermine if T should be incremented even more. For instance,

 i= 012345678
 T= ababababc
 W= ababc

Here, we would get a mismatch at i=4. Would it make sense to RESTART at this point? Of course not. We know that T[1]='b' and it won't match W[0]='a' anyway.

The main idea of the algorithm is to analyse W such that for each possible mismatch at i, T[i]≠W[i], determine how much T should be advanced and what value should be used for i from there on.

Now, let's think about how much T should be advanced. In order to advance T in a meaningful manner (i.e. not guaranteed to fail), we need to make sure that some bytes at the advanced position matches the first bytes of W. In the example above, advancing by 1 didn't work, but advancing by 2 would work:

   T[2]=W[0] and T[3]=W[1]
  
       due to (1),
   T[2]=W[2] and T[3]=W[3]

      therefore, the above observation is 
      really only about W:

   W[2]=W[0] and W[3]=W[1]

Generalizing this strategy, we can say the following. If Wⁱ contains a prefix of Wⁱ somewhere other than position 0, then we can advance T to match that part of W. As a graphic:

  R= r₀ r₁ .. r_n

  Wⁱ=  r₀ r₁ .. r_n ...
  Tⁱ=  r₀ r₁ .. r_n ...
    also,
  Wⁱ= ...   r₀ r₁ .. r_n ...
      ^--^ ^-----------^
        |       |        
        |       --- skip comparing these
        --- advance T by this amount

If we were to write it more formally:

  Wⁱ = R P  = Q R S    P,Q,R≠''  (3)

We advance T by |Q|.
Now T^|R|=W^|R| since we already did those comparisons before.
Therefore, the new value for i is |R|.

Now, if no such R can be found, then it means that no prefix repeats in Wⁱ. Therefore, W can not start within Tⁱ. In this case, we would simply increment T by i and restart the search. So, from now on I will assume there is such a solution to (3) and that we know all solutions to it by some magic.

Before going further let's reason about S. If S is not empty, then W and T will look like this:

 W = R₀ R₁ .. R_r P₀ P₁ .. P_p ... 
 W = Q₀ Q₁ .. Q_q R₀ R₁ .. R_r S₀ S₁ .. S_s ...
 T = Q₀ Q₁ .. Q_q R₀ R₁ .. R_r S₀ S₁ .. S_s ...

After doing our updates to T and i:

 W = R₀ R₁ .. R_r P₀ P₁ .. P_p ... 
 T = R₀ R₁ .. R_r S₀ S₁ .. S_s ...
                 ^
                 |
                 --- i now points here

At this point, there are two possibilities:

S₀ ≠ P₀: For solutions ending up like this, the solution is no good because the update is guaranteed to cause a mismatch. Therefore, such solutions shall be discarded.

S₀ = P₀: For this case, there is another solution where

  R' = R S₀
  P' = P[1..]
  Q' = Q
  S' = S[1..]

 Wⁱ = R  P  =  Q R S
    = R P[0] P[1..] = Q R S[0] S[1..]
                 since P[0]=S[0]
    = R S[0] P[1..] = Q R S[0] S[1..]
    = R' P' = Q' R' S'

This means that, if we find a solution to (3) where S≠'', we can extend R to the right until S becomes empty. If extension doesn't work, then that solution was useless anyway.

These two cases result in the following fact.

If prefix R repeats within Wⁱ, it has to reach the end of Wⁱ in order to be useful. In other words, it also has to be a suffix of Wⁱ.

Our partitioning has been simplified quite a bit now:

   Wⁱ = R P  = Q R       (4)

Let's consider the case where there are at least two solutions to (4) with R values A and B. WLOG, let |A|>|B|.

Since A and B both match the initial part of Wⁱ, B is a prefix of A. Last parts of A and B both match the last part of Wⁱ, therefore B is also a suffix of A.

Since A is longer than B, it starts earlier as a suffix in Wⁱ. If we chose B as our preferred R-value, then we would miss an earlier opportunity to match W.

  Wⁱ= A P_A = Q_A A 
  Wⁱ= B P_B = Q_B B

If we were to advance T by |Q_B|>|Q_A|, then we would skip over the initial part of A where it appears as a suffix. That is a legitimate potentially matching position because we know that T matches A. Therefore we mustn't use B. Doing this for all A,B pairs we realize that we must use only the longest R possible, for correctness.

Computing R

The actual values of R don't need to be stored because each R is a prefix of W and its contents can be retrieved from there if its length is known. Therefore, I will store only the length of R for each i in an array called L. I will still talk about R[i], but it won't be in the final code.

Since R[i] is the longest prefix/suffix in Wⁱ, it relates to the bytes W[0] thru W[i-1]. L[i]= |R[i]|. Let's look at some initial values of L and R.

For i=0, Wⁱ='', therefore L[0]:= 0.
For i=1, Wⁱ=w₀ -a single byte- therefore no R exists and L[1]:= 0.
For i=2, Wⁱ=w₀w₁. If w₀=w₁, then L[2]:=|R[2]=w₀|= 1. Otherwise L[2]:=0.

For i≥3, our computation will depend on R[i-1].

If R[i-1]='', then there is no prefix to which we can attach W_i-1 and make a solution for i. The only option is to start a new prefix with the value w_i-1 and that can only happen if w_i-1=w₀. Otherwise, we have no solution.

  if R[i-1]=''
      if w_i-1=w₀
           R[i]:= w_i-1
           L[i]:= 1
      else
           R[i]:= ''
           L[i]:= 0
      endif
  endif

On the other hand, if R[i-1]≠'', then we can try to construct R[i] by appending w_i-1 to R[i-1]:

  R[i]= R[i-1] w_i-1

We already know that R[i] is suffix of Wⁱ because of the construction above and the fact that R[i-1] is a suffix of W^i-1. However, R[i] can be used only if it is also a prefix. So, w_i-1 must match the first byte just after R within Wⁱ. That position is given by |R[i-1]|=L[i-1].

   if R[i-1]≠''
        if w_i-1=w_L[i-1]
             R[i]:= R[i-1] w_i-1
             L[i]:= |R[i]| = L[i-1]+1
        else
            it's complicated
        endif
   endif

After handling all these cases, we have arrived at the most interesting part of the algorithm. To summarize, we have a non-empty R for i-1, but it can't be extended by adding w_i-1 to it.

We know that R_i is not longer than R_i-1 since we tried to construct such an R_i and failed. If R_i exists, then it must be an extension of a proper prefix of R_i-1. Let me put it this way. If

  R_i = Z w_i-1

then |Z|<|R_i-1| because we tried Z=R_i-1 and failed. Now, if |Z|<|R_i-1| then Z must be a proper prefix of R_i-1 since they both match the first part of W, that is:

  W^i-1= R_i-1 P_i-1
    Wⁱ= Z w_i-1 P_i

Now, let's write down our facts and see what we can do.

R_i-1= Z Y
R_i= Z w_i-1          
Wⁱ= Q_i R_i 
Wⁱ= Q_i Z w_i-1  (replaced value of R_i)
W^i-1= Q_i Z     (if we remove the last byte from Wⁱ, we get W^i-1)
W^i-1= Q_i-1 R_i-1 (partitioning for W^i-1)
    If we equate the last two expressions,

          Q_i-1 R_i-1 = Q_i Z

The last equation is a very important result. It tells us that Z is a suffix of R_i-1. Well, it was also a prefix of R_i-1. I wonder where I can find a substring Z of R_i-1, where Z is both a prefix and a suffix? Wait, I KNOW! We already calculated it. It's stored right there in

  Z=  R[ |R_i-1| ]

If Z turns out to be the empty string, then we have no solution. Therefore we proceed with setting R[i]='' and so on.

Otherwise, now we use Z as a basis to construct R[i] by appending w_i-1 to it. We know that the suffix condition holds because we just constructed it with the last byte. However, the prefix condition also has to hold:

   R_i= Z w_i-1
   Wⁱ= R_i P_i
   Wⁱ= Z w_i-1 P_i
       therefore,
   w_i-1 must equal w_|Z|

Here, Z is guaranteed to be the best we can do because it was computed by the same algorithm.

The behaviour of the algorithm at this point is identical to the one we did before, while extending R_i-1 to get R_i. If the prefix condition fails, our only solution exists within the set of prefix/suffix substrings of Z and so on. Let's write down the whole code before we can simplify it:

  if R[i-1]=''
      if w_i-1=w₀
           R[i]:= w_i-1
           L[i]:= 1
      else
           R[i]:= ''
           L[i]:= 0
      endif
  endif
   if R[i-1]≠''
        if w_i-1=w_L[i-1]
             R[i]:= R[i-1] w_i-1
             L[i]:= |R[i]| = L[i-1]+1
        else
            it's complicated
            Z:= R[ |R[i-1]| ]
            z:= |Z|= L[ |R[i-1]| ]= L[ L[i-1] ]
          try_again:
            if Z≠''
               if w_i-1=w_z
                  R[i]:= Z w_i-1
                  L[i]:= |R[i]|= |Z| + 1= z+1
               else
                  Z:= R[z]
                  z:= |Z|= L[z]
                  goto try_again
               endif
            else
               no solution, do it like at the beginning
               R[i]:= ''
               L[i]:= 0
            endif 
        endif
   endif

Now that we're finished with R, we can totally remove it from the code and use only L.

  if L[i-1]=0
      if w_i-1=w₀
           L[i]:= 1
      else
           L[i]:= 0
      endif
  endif
   if L[i-1]≠0
        if w_i-1=w_L[i-1]
             L[i]:= L[i-1]+1
        else
            it's complicated
            z:= L[ L[i-1] ]
          try_again:
            if z≠0
               if w_i-1=w_z
                  L[i]:= z+1
               else
                  z:= L[z]
                  goto try_again
               endif
            else
               no solution, do it like at the beginning
               L[i]:= 0
            endif 
        endif
   endif

Since the outmost conditions are mutually exclusive, we can re-order them and put them into a single if-else.

  if L[i-1]≠0
        if w_i-1=w_L[i-1]
             L[i]:= L[i-1]+1
        else
            it's complicated
            z:= L[ L[i-1] ]
          try_again:
            if z≠0
               if w_i-1=w_z
                  L[i]:= z+1
               else
                  z:= L[z]
                  goto try_again
               endif
            else
               no solution, do it like at the beginning
               L[i]:= 0
            endif 
        endif
  else
      if w_i-1=w₀
           L[i]:= 1
      else
           L[i]:= 0
      endif
  endif

Now we can combine the two places where an extension is done by assigning i-1 to z initially.

  z:= i-1
  while L[z]≠0
    if w_i-1=w_L[z]
         L[i]:= L[z]+1
         break while loop
    else
         z:= L[z]
    endif
  if L[z]=0
     if w_i-1=w₀
          L[i]:= 1
     else
          L[i]:= 0
  endif

Prettifying it further:

  z:= i-1
  while L[z]≠0 and w_i-1≠w_L[z], z:= L[z]
  if L[z]=0
     L[i]:= w_i-1=w₀ ? 1:0
  else
     L[i]:= L[z]+1
  endif

In the last part, we can change the if-condition to get rid of the conditional expression:

  z:= i-1
  while L[z]≠0 and w_i-1≠w_L[z], z:= L[z]
  if w_i-1=w_L[z]
     L[i]:= L[z] + 1
  else
     L[i]:= 0
  endif

Putting it Together

In the above section, we calculate R and L, but they aren't really necessary for the original algorithm. All we need is |Q_i|.

When we advance T by |Q_i|, we also need to decrement i by the same amount so that the comparison can continue beyond R_i.

  Wⁱ= Q_i R_i 
  |Wⁱ|= i = |Q_i| + |R_i|
  Q[i]= i - R[i], if R[i]≠0
  Q[i]= 0       , if R[i]=0

For R[i]=0, I will use the sentinel value Q[i]=0. So now, I guess we're doing this:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int *compute_Q(char *W)
{
  int n,z,i,*L,*Q;
  n= strlen(W);
  L= calloc(n,sizeof(*L));
  for(i=2;i<n;i++)
  {
    z= i-1;
    while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
    if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
    // calloc makes it zero otherwise
  } 
  Q= L; // re-use the space
  for(i=0;i<n;i++) Q[i]= L[i]? i-L[i] : 0;
  return Q;
}


char *kmp(char *T, char *W,int *Q)
{
  int i,n;
  char *eT;
  eT= T+strlen(T);
  i= 0;
  n= strlen(W);
  if (T+n>eT) return NULL;
  while(1)
  {
    if (T[i]==W[i]) { i++; if (i==n) return T; continue; }
    if (!i) { T= strchr(T+1, W[0]); if (!T) return NULL; }
       else if (Q[i]) { T+= Q[i]; i-= Q[i]; }
                 else { T+= i; i= 0; }
    if (T+n>eT) return NULL;
  }
}

void kmptest(char *T, char *W)
{
  char *oT=T;
  int *Q, n;
  Q= compute_Q(W);
  n= strlen(W);
  printf("T: %s\nW: %s\n",T,W);
  T-= n;
  while ((T=kmp(T+n,W,Q))) printf("found at %d\n", T-oT); 
  free(Q);
}

struct {
  char *T, *W;
} TT[]= {
  { "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
  { "AAAABAAAAABBBAAAAB", "AAAB", },
  { "hayhello", "hell" },
  { NULL, NULL }
};

int main()
{
  int i;
  for(i=0;TT[i].T;i++) kmptest(TT[i].T,TT[i].W);
  return 0;
}

The above implementation isn't really O(|W|+|T|) but O(|T|²+|W|*|T|) because of the heavy usage of strlen. This can be fixed easily by passing n and eT around but this code is just a demo of the algorithm, not a finalized product.

As mentioned before, this code only finds non-overlapping matches to W in T. In order to find all matches, we extend W with a virtual character x such that x never compares equal to any other byte value. This way, a match is handled like a failure to match the extended W at the last byte (virtual character x). The match is reported and the usual updates to T and i are done to find the next (possibly overlapping) match.

I ended up not using this algorithm at all for my find/replace tool. In any case, I think this was a good exercise in understanding the KMP algorithm

Fixing the Rest

OK, let's do a proper job of this. Here's the fixed version with overlapping matches enabled, with the correct number of strlen calls.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int kmpall(char *T, char *W,
          int (*report)(void *, char *, char *, size_t),void* arg)
{
  char *eT, *oT;
  int n,z,i,*L,*Q;

  n= strlen(W);
  if (!n) return 0;
  L= calloc(n+1,sizeof(*L));
  if (!L) return 1;
  for(i=2;i<=n;i++)
  {
    z= i-1;
    while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
    if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
  } 
  Q= L; 
  for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0;

  eT= T+strlen(T);
  oT= T;
  i= 0;
again:
  if (T+n>eT) goto done;
  for(;i<n;i++) if (T[i]!=W[i]) goto fail;

  // match
  if (report && (*report)(arg, oT, W, T-oT)) goto done;
advance:
  if (Q[i]) { T+= Q[i]; i-= Q[i]; } 
       else { T+= i; i=0; }
  goto again;
fail:
  if (!i) { T++; goto again; }
  goto advance;
done:
  free(Q);
  return 0;
}

int prt(void *arg, char *T, char *W, size_t pos)
{
  printf("found at %d\n", (int) pos);
  return 0;
}

int main()
{
  static struct {
     char *T, *W;
  } TT[]= {
  { "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
  { "AAAABAAAAABBBAAAAB", "AAAB", },
  { "hayhello", "hell" },
  { NULL, NULL }
  };
  int i;
  for(i=0;TT[i].T;i++) kmpall(TT[i].T,TT[i].W, prt, NULL);
  return 0;
}

Since we know that i!=0 when there is a match (we checked for empty W at the beginning), we can safely move code part fail before advance.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int kmpall(char *T, char *W,
          int (*report)(void *, char *, char *, size_t),void* arg)
{
  char *eT, *oT;
  int n,z,i,*L,*Q;

  n= strlen(W);
  if (!n) return 0;
  L= calloc(n+1,sizeof(*L));
  if (!L) return 1;
  for(i=2;i<=n;i++)
  {
    z= i-1;
    while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
    if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
  } 
  Q= L; 
  for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0;

  eT= T+strlen(T);
  oT= T;
  i= 0;
again:
  if (T+n>eT) goto done;
  for(;i<n;i++) if (T[i]!=W[i]) goto fail;

  // match
  if (report && (*report)(arg, oT, W, T-oT)) goto done;
fail:
  if (!i) { T++; goto again; }
  goto advance;
advance:
  if (Q[i]) { T+= Q[i]; i-= Q[i]; } 
       else { T+= i; i=0; }
  goto again;
done:
  free(Q);
  return 0;
}

int prt(void *arg, char *T, char *W, size_t pos)
{
  printf("found at %d\n", (int) pos);
  return 0;
}

int main()
{
  static struct {
     char *T, *W;
  } TT[]= {
  { "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
  { "AAAABAAAAABBBAAAAB", "AAAB", },
  { "hayhello", "hell" },
  { NULL, NULL }
  };
  int i;
  for(i=0;TT[i].T;i++) kmpall(TT[i].T,TT[i].W, prt, NULL);
  return 0;
}

Now the whole thing can be made into a loop, removing the labels.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int kmpall(char *T, char *W,
          int (*report)(void *, char *, char *, size_t),void* arg)
{
  char *eT, *oT;
  int n,z,i,*L,*Q;

  n= strlen(W);
  if (!n) return 0;
  L= calloc(n+1,sizeof(*L));
  if (!L) return 1;
  for(i=2;i<=n;i++)
  {
    z= i-1;
    while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
    if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
  } 
  Q= L; 
  for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0;

  eT= T+strlen(T);
  oT= T;
  i= 0;
  while(T+n<=eT)
  {
    for(;i<n;i++) if (T[i]!=W[i]) break;
    if (i==n && report && (*report)(arg, oT, W, T-oT)) break;
    if (!i) { T++; } 
    else if (Q[i]) { T+= Q[i]; i-= Q[i]; } 
       else { T+= i; i=0; }
  }
  free(Q);
  return 0;
}

int prt(void *arg, char *T, char *W, size_t pos)
{
  printf("found at %d\n", (int) pos);
  return 0;
}

int main()
{
  static struct {
     char *T, *W;
  } TT[]= {
  { "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
  { "AAAABAAAAABBBAAAAB", "AAAB", },
  { "hayhello", "hell" },
  { NULL, NULL }
  };
  int i;
  for(i=0;TT[i].T;i++) 
  {
    printf("T: %s\nW: %s\n",TT[i].T,TT[i].W);
    kmpall(TT[i].T,TT[i].W, prt, NULL);
  }
  return 0;
}

We could also make use of the fact that Q[0]==0 in order to simplify the if statements regarding i and Q[i] in the matching part:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int kmpall(char *T, char *W,
          int (*report)(void *, char *, char *, size_t),void* arg)
{
  char *eT, *oT;
  int n,z,i,*L,*Q;

  n= strlen(W);
  if (!n) return 0;
  L= calloc(n+1,sizeof(*L));
  if (!L) return 1;
  for(i=2;i<=n;i++)
  {
    z= i-1;
    while(L[z]!=0 && W[i-1]!=W[L[z]]) z= L[z];
    if (W[i-1]==W[L[z]]) L[i]= L[z]+1;
  } 
  Q= L; 
  for(i=0;i<=n;i++) Q[i]= L[i]? i-L[i] : 0;

  eT= T+strlen(T);
  oT= T;
  i= 0;
  while(T+n<=eT)
  {
    for(;i<n;i++) if (T[i]!=W[i]) break;
    if (i==n && report && (*report)(arg, oT, W, T-oT)) break;
    if (Q[i]) { T+= Q[i]; i-= Q[i]; } 
         else { T+= i ? i : 1; i=0; }
  }
  free(Q);
  return 0;
}

int prt(void *arg, char *T, char *W, size_t pos)
{
  printf("found at %d\n", (int) pos);
  return 0;
}

int main()
{
  static struct {
     char *T, *W;
  } TT[]= {
  { "baabbbaabbaabbbabaabbbaabaabababba",  "baababa" },
  { "AAAABAAAAABBBAAAAB", "AAAB", },
  { "hayhello", "hell" },
  { NULL, NULL }
  };
  int i;
  for(i=0;TT[i].T;i++) 
  {
    printf("T: %s\nW: %s\n",TT[i].T,TT[i].W);
    kmpall(TT[i].T,TT[i].W, prt, NULL);
  }
  return 0;
}

Links

Here are links to some inferior pages talking about this :P