Newsgroups: comp.theory
Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!ira.uka.de!rusmv1!rusmv1!mark
From: mark@adler.philosophie.uni-stuttgart.de (Mark Johnson)
Subject: Re: Longest repeated substring problem.
In-Reply-To: ajay@cs.Buffalo.EDU's message of 15 Apr 91 18:08:55 GMT
Message-ID: <MARK.91Apr16123339@adler.philosophie.uni-stuttgart.de>
Sender: news@rusmv1.rus.uni-stuttgart.de (USENET News System)
Organization: IMSV, University of Stuttgart, Germany
References: <70943@eerie.acsu.Buffalo.EDU>
Date: 16 Apr 91 12:33:39
Lines: 81


Algorithm for finding longest common substring (I don't know if
this is new or not).

Input: An array A (of characters) of length n.

Output: Integers p, q, and l such that
            A[p] .. A[p+l] = A[q] .. A[q+l]
  and there do not exist p', q' and l' > l such that
            A[p'] .. A[p'+l'] = A[q'] .. A[q'+l']

  Its running time is O(l n lg n).
  

The algorithm rests on the observation that the required
substrings of A have the property that they are adjacent
in a sorted sequence of all substrings of A.  This algorithm
uses an idea of Martin Kay's for space- and time-efficiently 
sorting all of the substrings of a string.

The algorithm uses two arrays of integers P and L, both of
length n.

We use P as an array of indices into the array A,
whose value is the suffix of A beginning at P[i].

1. For 0 <= i < n set P[i] = i.

     P now contains indices to every suffix in A.

2. Sort the array P, where P[i] < P[j] iff the suffix
   of A beginning at P[i] precedes the suffix beginning
   at P[j] is standard lexicographic order.

      P now contains indices to every suffix in A
      in lexicographic order.

3. For each 0 <= i < n-1, let L[i] be the length of
   the common prefix of the suffixes of A beginning
   at P[i] and P[i+1].

4. Find a maximum m such that for all 0 <= i < n-1,
   L[m] >= L[i].  Set p = P[m], q = P[m+1], l = L[m]
   and stop.

We estimate the time complexity as follows.

Step 1 takes O(n) time.
Step 2 takes O(l n lg n) (since each comparision
   may need to compare at most l characters)
Step 3 takes O(l n) time.
Step 4 takes O(n) time.

Comments?

Mark Johnson
--

Mark Johnson
Institut fuer maschinelle Sprachverarbeitung - Computerlinguistik
Universitaet Stuttgart
Keplerstrasse 17
D-7000 Stuttgart 1
Germany.

phone: 0711 - 121 3132.

On leave until mid July 1991 from:

Cognitive and Linguistic Sciences, Box 1978
Brown University
Providence, RI 02912
USA

email addresses: 

mark@adler.philosophie.uni-stuttgart.de
mj@cs.brown.edu
markj@ai.mit.edu
johnson@csli.stanford.edu
mark@ego.mit.edu
